You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@trafficserver.apache.org by Walt Karas <wk...@verizonmedia.com.INVALID> on 2019/09/30 17:20:02 UTC

Tentative proposal: TS mutexes should be queues not OS mutexes

If a Continuation is scheduled, but its mutex is locked, it's put in a
queue specific to that mutex.  The release function for the mutex (called
when a Continuation holding the mutex exists) would put the Continuation at
the front of the mutex's queue (if not empty) into the ready-to-run queue
(transferring the lock to that Continuation).  A drawback is that the queue
would itself need a mutex (spinlock?), but the critical section would be
very short.

There would be a function to lock a mutex directly.  It would create a
Continuation that had two condition variables.  It would assign the mutex
to this Continuation and schedule it.  (In this case, it might make sense
to put this Continuation at the front of the mutex's queue, since it would
be blocking an entire event thread.)  The direct-lock function would then
block on the first condition variable.  When the Continuation ran, it would
trigger the first condition variable, and then block on the second
condition variable.  The direct-lock function would then exit, allowing the
calling code to enter its critical section.  At the end of the critical
section, another function to release the direct lock would be called.  It
would trigger the second condition variable, which would cause the function
of the Continuation created for the direct lock to exit (thus releasing the
mutex).

With this approach, I'm not sure thread affinities would be of any value.
I think perhaps each core should have it's own list of ready-to-run
Continuations, and a pool of event threads with affinity to that core.  Not
having per-event-thread ready-to-run lists means that a Continuation
function that blocks is less likely to block other ready-to-run
Continuations.  If Continuations had core affinities to some degree, this
might reduce evictions in per-core memory cache.  (Multiple Continuations
having the same function should have the same core affinity.)

Re: Tentative proposal: TS mutexes should be queues not OS mutexes

Posted by Kees Spoelstra <ks...@we-amp.com>.
Sounds very interesting.
But what is the problem we're trying to solve here, I like the thread
affinity because it gives us head ache free concurrency in some cases, and
I'll bet that there is some code which doesn't have the proper continuation
mutexes because we know it runs on the same thread.And often we're trying
to get out of recursion or prevent recursion by adding to the continuation
queue (on the same thread)

Are we seeing a lot of imbalanced threads (too much processing causing long
queues of continuations, which I can imagine in some cases) ? And shouldn't
we balance based on transactions or connections, move those around when we
see imbalance and aim for embarrassingly parallel processing :) Come
to think of it, this might introduce another set of problems, how to know
which continuations are part of the life cycle of a connection :/

Jumping threads in one transaction is not always ideal either, this can
really hurt performance. But your proposed model seems to handle that
somewhat better than the current implementation.

Very interested and wondering what this would mean for plugin developers

Apologies if I have posted this twice

On Tue, 1 Oct 2019 at 01:20, Walt Karas <wk...@verizonmedia.com.invalid>
wrote:

> If a Continuation is scheduled, but its mutex is locked, it's put in a
> queue specific to that mutex.  The release function for the mutex (called
> when a Continuation holding the mutex exists) would put the Continuation at
> the front of the mutex's queue (if not empty) into the ready-to-run queue
> (transferring the lock to that Continuation).  A drawback is that the queue
> would itself need a mutex (spinlock?), but the critical section would be
> very short.
>
> There would be a function to lock a mutex directly.  It would create a
> Continuation that had two condition variables.  It would assign the mutex
> to this Continuation and schedule it.  (In this case, it might make sense
> to put this Continuation at the front of the mutex's queue, since it would
> be blocking an entire event thread.)  The direct-lock function would then
> block on the first condition variable.  When the Continuation ran, it would
> trigger the first condition variable, and then block on the second
> condition variable.  The direct-lock function would then exit, allowing the
> calling code to enter its critical section.  At the end of the critical
> section, another function to release the direct lock would be called.  It
> would trigger the second condition variable, which would cause the function
> of the Continuation created for the direct lock to exit (thus releasing the
> mutex).
>
> With this approach, I'm not sure thread affinities would be of any value.
> I think perhaps each core should have it's own list of ready-to-run
> Continuations, and a pool of event threads with affinity to that core.  Not
> having per-event-thread ready-to-run lists means that a Continuation
> function that blocks is less likely to block other ready-to-run
> Continuations.  If Continuations had core affinities to some degree, this
> might reduce evictions in per-core memory cache.  (Multiple Continuations
> having the same function should have the same core affinity.)
>

Re: Tentative proposal: TS mutexes should be queues not OS mutexes

Posted by Walt Karas <wk...@verizonmedia.com.INVALID>.
The thing with profiling is trying to cover all the important traffic mixes
and plugin use (impossible because of proprietary plugins).  It seems
desirable to have a design that's robust even when Continuations block more
than they ideally should.  Having a single run queue per core and N worker
event threads per core helps greatly with preventing a blocked Continuation
from blocking other Continuations from starting (increasing worst-case
latency).  Each core could have a supervisor thread with this pseudo-code:

bool idle[Num_workers_pre_core];
Condition_variable go[Num_workers_per_core];
Continuation *curr[Num_workers_per_core];

while (true)
{
  blocking get on message queue;
  switch (msg.type)
  {
    case WORKER_THREAD_READY_FOR_NEXT:  // Sent by worker thread.
      if (run queue empty) {
        idle[msg.worker_index] = true;
      } else {
        curr[msg.worker_index] = dequeue from run queue;
        trigger go[msg.worker_index];
      }
    break;

    case QUEUE_CONTINUATION_TO_RUN:
      run_index = index where idle[index] is true or none;
      if (run_index == none) {
        enqueue msg.continuation in run queue;
      } else {
        idel[run_index] = false;
        curr[run_index] = msg.continuation;
        trigger go[run_index];
      }
    break;

  } // end switch
} // end while

On Wed, Oct 2, 2019 at 12:23 PM Walt Karas <wk...@verizonmedia.com> wrote:

> What's the best tool for multi-threaded profiling on Linux?
>
> On Wed, Oct 2, 2019 at 10:14 AM Alan Carroll
> <so...@verizonmedia.com.invalid> wrote:
>
>> Correct, it doesn't mean no lock contention. The claim was it reduced the
>> lock contention to a level where it's not significant enough to warrant
>> additional preventative measures. The data Dan got wasn't from code
>> analysis, but from run time profiling. That was a while ago so if you'd
>> like to perform another pass of measuring the level of lock contention,
>> that would certainly be interesting data.
>>
>> In addition, the push for thread affinity started from actual issues in
>> production with Continuations being scheduled on different threads of the
>> same type (that is, it was Kees' fault). Those would not be resolved by
>> faster scheduling on different threads.
>>
>> On Tue, Oct 1, 2019 at 11:49 AM Walt Karas <wkaras@verizonmedia.com
>> .invalid>
>> wrote:
>>
>> > I assume thread affinity can't literal mean no lock contention.  You'd
>> need
>> > a lock on the thread run queue wouldn't you?  Continuations can't only
>> get
>> > queued for the event thread from the event thread.  I don't think we can
>> > say conclusively that there would be a significant difference due to
>> lock
>> > contention.  I'm guessing that Fei would agree that the Continuation
>> > dispatch code is difficult to understand and work on.  Simplification
>> and
>> > more modularity is obviously a goal.  Seems like it would be simpler if
>> all
>> > the Continuations in a to-run list where actually ready to run.
>> >
>> > On Tue, Oct 1, 2019 at 9:22 AM Alan Carroll
>> > <so...@verizonmedia.com.invalid> wrote:
>> >
>> > > Do you have any more specific information on mutex contention? We
>> have in
>> > > fact already looked at doing this, I think back in 2015 with Dan Xu.
>> The
>> > > goal there was to have queues with the mutexes to avoid rescheduling.
>> As
>> > a
>> > > precursor Dan did some profiling and the only significant lock
>> contention
>> > > he could find was in the cache. That lead to the partial object
>> caching
>> > > work setting up queues for the hot locks, but it was decided that
>> given
>> > the
>> > > lack of
>> > > contention elsewhere, it wasn't worth the complexity.
>> > >
>> > > I think thread affinity is a better choice because no lock contention
>> > will
>> > > always beat even the most optimized lock contention resolution. If
>> > > Continuations related to the same constellation of data objects are on
>> > the
>> > > same thread, then the locks are never contested, which makes it as
>> fast
>> > as
>> > > possible.
>> > >
>> > > On Mon, Sep 30, 2019 at 3:45 PM Walt Karas <wkaras@verizonmedia.com
>> > > .invalid>
>> > > wrote:
>> > >
>> > > > From the longer-term TSers I've heard comments about seeing
>> profiling
>> > > > results that show that waiting on mutexes is a significant
>> performance
>> > > > issue with TS.  But I'm not aware of any write-ups of such results.
>> > > > Unfortunately, I'm relatively new to TS and Linux, so I'm not
>> currently
>> > > > familiar with the best approaches to profiling TS.
>> > > >
>> > > > For better performance, I think that having a single to-run
>> > Continuation
>> > > > queue, or one per core, with a queue feeding multiple event threads
>> is
>> > > the
>> > > > main thing.  It's more resilient to Continuations that block.  There
>> > > > doesn't seem to be enthusiasm for getting hard-core about not having
>> > > > blocking Continuations (see
>> > > > https://github.com/apache/trafficserver/pull/5412 ).  I'm not sure
>> > > > changing
>> > > > to queue-based mutexes would have a significant performance impact.
>> > But
>> > > it
>> > > > seems a cleaner design, making sure Continuations in the to-run
>> list(s)
>> > > are
>> > > > actually ready to run.  But a different mutex implementation is not
>> > > > strictly necessary in order to consolidate to-run Continuation
>> queues.
>> > > >
>> > > > On Mon, Sep 30, 2019 at 2:39 PM Kees Spoelstra <
>> kspoelstra@we-amp.com>
>> > > > wrote:
>> > > >
>> > > > > Sounds very interesting.
>> > > > > But what is the problem we're trying to solve here, I like the
>> thread
>> > > > > affinity because it gives us head ache free concurrency in some
>> > cases,
>> > > > and
>> > > > > I'll bet that there is some code which doesn't have the proper
>> > > > continuation
>> > > > > mutexes because we know it runs on the same thread.
>> > > > >
>> > > > > Are we seeing a lot of imbalanced threads (too much processing
>> > causing
>> > > > long
>> > > > > queues of continuations, which I can imagine in some cases) ? And
>> > > > shouldn't
>> > > > > we balance based on transactions or connections, move those around
>> > when
>> > > > we
>> > > > > see imbalance and aim for embarrassingly parallel processing :)
>> Come
>> > > > > to think of it, this might introduce another set of problems, how
>> to
>> > > know
>> > > > > which continuations are part of the life cycle of a connection :/
>> > > > >
>> > > > > Jumping threads in one transaction is not always ideal either,
>> this
>> > can
>> > > > > really hurt performance. But your proposed model seems to handle
>> that
>> > > > > somewhat better than the current implementation.
>> > > > >
>> > > > > Very interested and wondering what this would mean for plugin
>> > > developers.
>> > > > >
>> > > > > On Mon, 30 Sep 2019, 19:20 Walt Karas, <wkaras@verizonmedia.com
>> > > .invalid>
>> > > > > wrote:
>> > > > >
>> > > > > > If a Continuation is scheduled, but its mutex is locked, it's
>> put
>> > in
>> > > a
>> > > > > > queue specific to that mutex.  The release function for the
>> mutex
>> > > > (called
>> > > > > > when a Continuation holding the mutex exists) would put the
>> > > > Continuation
>> > > > > at
>> > > > > > the front of the mutex's queue (if not empty) into the
>> ready-to-run
>> > > > queue
>> > > > > > (transferring the lock to that Continuation).  A drawback is
>> that
>> > the
>> > > > > queue
>> > > > > > would itself need a mutex (spinlock?), but the critical section
>> > would
>> > > > be
>> > > > > > very short.
>> > > > > >
>> > > > > > There would be a function to lock a mutex directly.  It would
>> > create
>> > > a
>> > > > > > Continuation that had two condition variables.  It would assign
>> the
>> > > > mutex
>> > > > > > to this Continuation and schedule it.  (In this case, it might
>> make
>> > > > sense
>> > > > > > to put this Continuation at the front of the mutex's queue,
>> since
>> > it
>> > > > > would
>> > > > > > be blocking an entire event thread.)  The direct-lock function
>> > would
>> > > > then
>> > > > > > block on the first condition variable.  When the Continuation
>> ran,
>> > it
>> > > > > would
>> > > > > > trigger the first condition variable, and then block on the
>> second
>> > > > > > condition variable.  The direct-lock function would then exit,
>> > > allowing
>> > > > > the
>> > > > > > calling code to enter its critical section.  At the end of the
>> > > critical
>> > > > > > section, another function to release the direct lock would be
>> > called.
>> > > > It
>> > > > > > would trigger the second condition variable, which would cause
>> the
>> > > > > function
>> > > > > > of the Continuation created for the direct lock to exit (thus
>> > > releasing
>> > > > > the
>> > > > > > mutex).
>> > > > > >
>> > > > > > With this approach, I'm not sure thread affinities would be of
>> any
>> > > > value.
>> > > > > > I think perhaps each core should have it's own list of
>> ready-to-run
>> > > > > > Continuations, and a pool of event threads with affinity to that
>> > > core.
>> > > > > Not
>> > > > > > having per-event-thread ready-to-run lists means that a
>> > Continuation
>> > > > > > function that blocks is less likely to block other ready-to-run
>> > > > > > Continuations.  If Continuations had core affinities to some
>> > degree,
>> > > > this
>> > > > > > might reduce evictions in per-core memory cache.  (Multiple
>> > > > Continuations
>> > > > > > having the same function should have the same core affinity.)
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: Tentative proposal: TS mutexes should be queues not OS mutexes

Posted by Alan Carroll <so...@verizonmedia.com.INVALID>.
It's reasonably likely this was discussed on the mailing list the last time
around, probably in the 2015-2016 time frame. Alternatively it might have
been a JIRA ticket on the old ASF Jira.

On Thu, Oct 3, 2019 at 9:57 AM Walt Karas <wk...@verizonmedia.com.invalid>
wrote:

> Thread affinity seems like a worthwhile incremental improvement for the
> current approach.  I'm just offering alternatives approach for
> consideration.
>
> Were any measurements sent to this mailing list or recorded somewhere else?
>
> On Thu, Oct 3, 2019 at 9:14 AM Alan Carroll
> <so...@verizonmedia.com.invalid> wrote:
>
> > I didn't see the thread affinity as a rathole, but more as a
> regularization
> > of existing practice. That kind of thing was already being done in many
> > places in an ad hoc manner (e.g. cache, DNS, VC migration). Putting it in
> > the Thread support classes meant there was a single implementation, not
> > several, with the consequent and highly desired consistency of operation.
> >
> > I also considered it very desirable for plugins to be able to do similar
> > things, which is why I championed Kees' request on that. Any plugin doing
> > asynchronous operations found it difficult to impossible to do the
> correct
> > thing. Now it's easy and mostly automatic. As for tests on that, we
> > certainly had mysterious crashes and other problems due to the lack of
> such
> > a mechanism and even now we're still cleaning up plugins that do bad
> things
> > with scheduling Continuations. The difference now is we can put in a
> clean
> > fix rather than obscure hacks. This went through because it addressed, as
> > Kees noted, a well known and chronic issue in a reasonable way,
> formalizing
> > what had been unwritten rules.
> >
> > Let me note again, we considered this previously, did actual
> measurements,
> > and concluded the gains were much too small to justify the code (except
> in
> > the cache, which does have hot locks). And as noted above, given that we
> > were already doing crude forms of thread affinity, it seemed making that
> > more general and consistent was the best approach to reducing lock
> > contention, especially as it provided other benefits.
> >
> > On Wed, Oct 2, 2019 at 4:30 PM Walt Karas <wkaras@verizonmedia.com
> > .invalid>
> > wrote:
> >
> > > On Wed, Oct 2, 2019 at 3:19 PM Kees Spoelstra <ks...@we-amp.com>
> > > wrote:
> > >
> > > > I think we're all able to dig some ratholes with less code than 30
> > lines.
> > > >
> > >
> > > Him, I thought "rathole" implied unneeded complexity but it seems to be
> > > just a synonym for "bad".
> > >
> > >
> > > >
> > > > Alan, it's only my fault that I pushed for threadaffinity for plugins
> > > > after I saw how it was in core... Can't keep all the goodies for the
> > core
> > > > :)
> > > >
> > > > One thing to watch out for is in order execution of continuations, as
> > > I've
> > > > mentioned earlier there is a big possibility that there is code which
> > > does
> > > > not properly lock ( or does not need to) and might get executing
> > > > continuations at the same time instead of in order when running
> > > > concurrently.
> > > > Sometimes to get out of the current executing context which might be
> > deep
> > > > in the stack (x locks down and before some cleanup) you will schedule
> > > > another continuation on the same thread, which can give you a cleaner
> > > > context.
> > > > Imagine the scheduler picking that up and running it concurrently on
> > > > another thread on the same core, that might yield weird results.
> > > > Multiple threads might break that implicit thread safety
> > > >
> > >
> > > When one Continuation schedules another, it could go into a queue of
> > > Continuations that become runnable after the current Continuation
> exits.
> > > So the pseudo-code becomes:
> > >
> > > bool idle[Num_workers_pre_core];
> > > Condition_variable go[Num_workers_per_core];
> > > Continuation *curr[Num_workers_per_core];
> > > Continuation_queue after[Num_workers_per_core]; // Run after current
> > > continuation.
> > >
> > > while (true)
> > > {
> > >   blocking get on message queue;
> > >   switch (msg.type)
> > >   {
> > >     case WORKER_THREAD_READY_FOR_NEXT:  // Sent by worker thread.
> > >       while (after[msg.worker_index} is not empty) {
> > >         c = pop from after[msg.worker_index];
> > >         push c to run queue;
> > >       }
> > >       if (run queue empty) {
> > >         idle[msg.worker_index] = true;
> > >       } else {
> > >         curr[msg.worker_index] = dequeue from run queue;
> > >         trigger go[msg.worker_index];
> > >       }
> > >     break;
> > >
> > >     case QUEUE_CONTINUATION_TO_RUN:
> > >       run_index = index where idle[index] is true or none;
> > >       if (run_index == none) {
> > >         enqueue msg.continuation in run queue;
> > >       } else {
> > >         idel[run_index] = false;
> > >         curr[run_index] = msg.continuation;
> > >         trigger go[run_index];
> > >       }
> > >     break;
> > >
> > >   } // end switch
> > > } // end while
> > >
> > > On the other hand, it's not that hard for a specific Continuation to
> not
> > > trigger successive Continuations until they are actually ready to run.
> > It
> > > would simply need to have its own "after" queue.  We should be
> selective
> > > about how many capabilities we pull into the core Event execution code,
> > and
> > > how much we damage general behavior by doing so.
> > >
> > >
> > > >
> > > > Locking mostly gets/got screwy in the plugins, because of possible
> > > > thread/core jumping, the plugin affinity functions help a lot with
> > that.
> > > > There are patterns to prevent that, try lock, but they're a bit of a
> > pain
> > > > and I've seen people ignore that.
> > > >
> > > > Some findings we had when doing h2. Thread affinity might not always
> > > yield
> > > > the best latency on an almost idle system if the protocols on both
> > sides
> > > > are relatively heavy weight the packing and unpacking will take time
> > and
> > > > one can imagine there is some parallelism which could decrease
> latency.
> > > > We've thought about controlled handover to other threads, but never
> > made
> > > > any work of it.
> > > >
> > > > I can imagine with quic coming that we might see some thread jumping
> > > > again, haven't been following progress on that one.
> > > >
> > > > On Wed, 2 Oct 2019, 21:33 Walt Karas, <wkaras@verizonmedia.com
> > .invalid>
> > > > wrote:
> > > >
> > > >> On Wed, Oct 2, 2019 at 2:25 PM Leif Hedstrom <zw...@apache.org>
> > wrote:
> > > >>
> > > >> >
> > > >> >
> > > >> > > On Oct 2, 2019, at 1:00 PM, Walt Karas <wkaras@verizonmedia.com
> > > >> .INVALID>
> > > >> > wrote:
> > > >> > >
> > > >> > > I feel like you overestimate me if you think I could create a
> > > rathole
> > > >> > with
> > > >> > > just 30 lines of pseudo-code.
> > > >> >
> > > >> > We have faith in you.
> > > >> >
> > > >>
> > > >> What's an example of a rathole I've already created?  Seems like a
> > very
> > > >> big
> > > >> leap of faith if there isn't one.
> > > >>
> > > >> Question:  Why do we have more event threads than cores?
> > > >>
> > > >> If it's to mitigate the fact that ATS is not purist event-driven,
> that
> > > >> Continuations sometimes do block, it makes sense to use the best
> > design
> > > to
> > > >> mitigate this.  Sorry but I don't think any big profiling hoedown is
> > > >> needed
> > > >> to see that what I'm proposing mitigates it better.  It's
> essentially
> > > the
> > > >> same idea as having a single line to pay even when there are
> multiple
> > > >> cashiers.
> > > >>
> > > >>
> > > >> >
> > > >> > >
> > > >> > > We're making complex thread affinity changes, on top of an event
> > > >> > execution
> > > >> > > code that already seem complex.  (I leave it to those with more
> > > >> rathole
> > > >> > > background to say if it should be so labeled.) . Can I see the
> > > >> profiling
> > > >> > > results used to justify and guide the thread affinity design?
> > > >> >
> > > >> >
> > > >> > +1.
> > > >> >
> > > >> > — leif
> > > >> >
> > > >> > >
> > > >> > > On Wed, Oct 2, 2019 at 1:32 PM Leif Hedstrom <zw...@apache.org>
> > > >> wrote:
> > > >> > >
> > > >> > >>
> > > >> > >>
> > > >> > >>> On Oct 2, 2019, at 11:23 AM, Walt Karas <
> > wkaras@verizonmedia.com
> > > >> > .INVALID>
> > > >> > >> wrote:
> > > >> > >>>
> > > >> > >>> What's the best tool for multi-threaded profiling on Linux?
> > > >> > >>
> > > >> > >> Probably the Intel tools.  Probably the “perf” toolchain works
> > > well,
> > > >> at
> > > >> > >> least on modern linux, you could do perf lock etc.
> > > >> > >>
> > > >> > >>>
> > > >> > >>> On Wed, Oct 2, 2019 at 10:14 AM Alan Carroll
> > > >> > >>> <so...@verizonmedia.com.invalid> wrote:
> > > >> > >>>
> > > >> > >>>> Correct, it doesn't mean no lock contention. The claim was it
> > > >> reduced
> > > >> > >> the
> > > >> > >>>> lock contention to a level where it's not significant enough
> to
> > > >> > warrant
> > > >> > >>>> additional preventative measures. The data Dan got wasn't
> from
> > > code
> > > >> > >>>> analysis, but from run time profiling. That was a while ago
> so
> > if
> > > >> > you'd
> > > >> > >>>> like to perform another pass of measuring the level of lock
> > > >> > contention,
> > > >> > >>>> that would certainly be interesting data.
> > > >> > >>
> > > >> > >>
> > > >> > >> So, before we dig outselves into this rathole, can someone
> > explain
> > > >> to me
> > > >> > >> what the problem is? Where are the metrics / analysis showing
> > that
> > > we
> > > >> > have
> > > >> > >> a problem? I’d love to see a comparison too between various
> > version
> > > >> of
> > > >> > ATS,
> > > >> > >> say v6 - v9, and understand where, if anywhere, we made things
> > > (lock
> > > >> > >> contention) worse.
> > > >> > >>
> > > >> > >> We have to stop making complicated and invasive solutions
> without
> > > >> real
> > > >> > >> problems, and understanding such problem.
> > > >> > >>
> > > >> > >> My $0.01,
> > > >> > >>
> > > >> > >> — Leif
> > > >> > >>
> > > >> > >>>>
> > > >> > >>>> In addition, the push for thread affinity started from actual
> > > >> issues
> > > >> > in
> > > >> > >>>> production with Continuations being scheduled on different
> > > threads
> > > >> of
> > > >> > >> the
> > > >> > >>>> same type (that is, it was Kees' fault). Those would not be
> > > >> resolved
> > > >> > by
> > > >> > >>>> faster scheduling on different threads.
> > > >> > >>>>
> > > >> > >>>> On Tue, Oct 1, 2019 at 11:49 AM Walt Karas <
> > > >> wkaras@verizonmedia.com
> > > >> > >>>> .invalid>
> > > >> > >>>> wrote:
> > > >> > >>>>
> > > >> > >>>>> I assume thread affinity can't literal mean no lock
> > contention.
> > > >> > You'd
> > > >> > >>>> need
> > > >> > >>>>> a lock on the thread run queue wouldn't you?  Continuations
> > > can't
> > > >> > only
> > > >> > >>>> get
> > > >> > >>>>> queued for the event thread from the event thread.  I don't
> > > think
> > > >> we
> > > >> > >> can
> > > >> > >>>>> say conclusively that there would be a significant
> difference
> > > due
> > > >> to
> > > >> > >> lock
> > > >> > >>>>> contention.  I'm guessing that Fei would agree that the
> > > >> Continuation
> > > >> > >>>>> dispatch code is difficult to understand and work on.
> > > >> Simplification
> > > >> > >> and
> > > >> > >>>>> more modularity is obviously a goal.  Seems like it would be
> > > >> simpler
> > > >> > if
> > > >> > >>>> all
> > > >> > >>>>> the Continuations in a to-run list where actually ready to
> > run.
> > > >> > >>>>>
> > > >> > >>>>> On Tue, Oct 1, 2019 at 9:22 AM Alan Carroll
> > > >> > >>>>> <so...@verizonmedia.com.invalid> wrote:
> > > >> > >>>>>
> > > >> > >>>>>> Do you have any more specific information on mutex
> > contention?
> > > We
> > > >> > have
> > > >> > >>>> in
> > > >> > >>>>>> fact already looked at doing this, I think back in 2015
> with
> > > Dan
> > > >> Xu.
> > > >> > >>>> The
> > > >> > >>>>>> goal there was to have queues with the mutexes to avoid
> > > >> > rescheduling.
> > > >> > >>>> As
> > > >> > >>>>> a
> > > >> > >>>>>> precursor Dan did some profiling and the only significant
> > lock
> > > >> > >>>> contention
> > > >> > >>>>>> he could find was in the cache. That lead to the partial
> > object
> > > >> > >> caching
> > > >> > >>>>>> work setting up queues for the hot locks, but it was
> decided
> > > that
> > > >> > >> given
> > > >> > >>>>> the
> > > >> > >>>>>> lack of
> > > >> > >>>>>> contention elsewhere, it wasn't worth the complexity.
> > > >> > >>>>>>
> > > >> > >>>>>> I think thread affinity is a better choice because no lock
> > > >> > contention
> > > >> > >>>>> will
> > > >> > >>>>>> always beat even the most optimized lock contention
> > resolution.
> > > >> If
> > > >> > >>>>>> Continuations related to the same constellation of data
> > objects
> > > >> are
> > > >> > on
> > > >> > >>>>> the
> > > >> > >>>>>> same thread, then the locks are never contested, which
> makes
> > it
> > > >> as
> > > >> > >> fast
> > > >> > >>>>> as
> > > >> > >>>>>> possible.
> > > >> > >>>>>>
> > > >> > >>>>>> On Mon, Sep 30, 2019 at 3:45 PM Walt Karas <
> > > >> wkaras@verizonmedia.com
> > > >> > >>>>>> .invalid>
> > > >> > >>>>>> wrote:
> > > >> > >>>>>>
> > > >> > >>>>>>> From the longer-term TSers I've heard comments about
> seeing
> > > >> > profiling
> > > >> > >>>>>>> results that show that waiting on mutexes is a significant
> > > >> > >>>> performance
> > > >> > >>>>>>> issue with TS.  But I'm not aware of any write-ups of such
> > > >> results.
> > > >> > >>>>>>> Unfortunately, I'm relatively new to TS and Linux, so I'm
> > not
> > > >> > >>>> currently
> > > >> > >>>>>>> familiar with the best approaches to profiling TS.
> > > >> > >>>>>>>
> > > >> > >>>>>>> For better performance, I think that having a single
> to-run
> > > >> > >>>>> Continuation
> > > >> > >>>>>>> queue, or one per core, with a queue feeding multiple
> event
> > > >> threads
> > > >> > >>>> is
> > > >> > >>>>>> the
> > > >> > >>>>>>> main thing.  It's more resilient to Continuations that
> > block.
> > > >> > There
> > > >> > >>>>>>> doesn't seem to be enthusiasm for getting hard-core about
> > not
> > > >> > having
> > > >> > >>>>>>> blocking Continuations (see
> > > >> > >>>>>>> https://github.com/apache/trafficserver/pull/5412 ).  I'm
> > not
> > > >> sure
> > > >> > >>>>>>> changing
> > > >> > >>>>>>> to queue-based mutexes would have a significant
> performance
> > > >> impact.
> > > >> > >>>>> But
> > > >> > >>>>>> it
> > > >> > >>>>>>> seems a cleaner design, making sure Continuations in the
> > > to-run
> > > >> > >>>> list(s)
> > > >> > >>>>>> are
> > > >> > >>>>>>> actually ready to run.  But a different mutex
> implementation
> > > is
> > > >> not
> > > >> > >>>>>>> strictly necessary in order to consolidate to-run
> > Continuation
> > > >> > >>>> queues.
> > > >> > >>>>>>>
> > > >> > >>>>>>> On Mon, Sep 30, 2019 at 2:39 PM Kees Spoelstra <
> > > >> > >>>> kspoelstra@we-amp.com>
> > > >> > >>>>>>> wrote:
> > > >> > >>>>>>>
> > > >> > >>>>>>>> Sounds very interesting.
> > > >> > >>>>>>>> But what is the problem we're trying to solve here, I
> like
> > > the
> > > >> > >>>> thread
> > > >> > >>>>>>>> affinity because it gives us head ache free concurrency
> in
> > > some
> > > >> > >>>>> cases,
> > > >> > >>>>>>> and
> > > >> > >>>>>>>> I'll bet that there is some code which doesn't have the
> > > proper
> > > >> > >>>>>>> continuation
> > > >> > >>>>>>>> mutexes because we know it runs on the same thread.
> > > >> > >>>>>>>>
> > > >> > >>>>>>>> Are we seeing a lot of imbalanced threads (too much
> > > processing
> > > >> > >>>>> causing
> > > >> > >>>>>>> long
> > > >> > >>>>>>>> queues of continuations, which I can imagine in some
> > cases) ?
> > > >> And
> > > >> > >>>>>>> shouldn't
> > > >> > >>>>>>>> we balance based on transactions or connections, move
> those
> > > >> around
> > > >> > >>>>> when
> > > >> > >>>>>>> we
> > > >> > >>>>>>>> see imbalance and aim for embarrassingly parallel
> > processing
> > > :)
> > > >> > >>>> Come
> > > >> > >>>>>>>> to think of it, this might introduce another set of
> > problems,
> > > >> how
> > > >> > >>>> to
> > > >> > >>>>>> know
> > > >> > >>>>>>>> which continuations are part of the life cycle of a
> > > connection
> > > >> :/
> > > >> > >>>>>>>>
> > > >> > >>>>>>>> Jumping threads in one transaction is not always ideal
> > > either,
> > > >> > this
> > > >> > >>>>> can
> > > >> > >>>>>>>> really hurt performance. But your proposed model seems to
> > > >> handle
> > > >> > >>>> that
> > > >> > >>>>>>>> somewhat better than the current implementation.
> > > >> > >>>>>>>>
> > > >> > >>>>>>>> Very interested and wondering what this would mean for
> > plugin
> > > >> > >>>>>> developers.
> > > >> > >>>>>>>>
> > > >> > >>>>>>>> On Mon, 30 Sep 2019, 19:20 Walt Karas, <
> > > >> wkaras@verizonmedia.com
> > > >> > >>>>>> .invalid>
> > > >> > >>>>>>>> wrote:
> > > >> > >>>>>>>>
> > > >> > >>>>>>>>> If a Continuation is scheduled, but its mutex is locked,
> > > it's
> > > >> put
> > > >> > >>>>> in
> > > >> > >>>>>> a
> > > >> > >>>>>>>>> queue specific to that mutex.  The release function for
> > the
> > > >> mutex
> > > >> > >>>>>>> (called
> > > >> > >>>>>>>>> when a Continuation holding the mutex exists) would put
> > the
> > > >> > >>>>>>> Continuation
> > > >> > >>>>>>>> at
> > > >> > >>>>>>>>> the front of the mutex's queue (if not empty) into the
> > > >> > >>>> ready-to-run
> > > >> > >>>>>>> queue
> > > >> > >>>>>>>>> (transferring the lock to that Continuation).  A
> drawback
> > is
> > > >> that
> > > >> > >>>>> the
> > > >> > >>>>>>>> queue
> > > >> > >>>>>>>>> would itself need a mutex (spinlock?), but the critical
> > > >> section
> > > >> > >>>>> would
> > > >> > >>>>>>> be
> > > >> > >>>>>>>>> very short.
> > > >> > >>>>>>>>>
> > > >> > >>>>>>>>> There would be a function to lock a mutex directly.  It
> > > would
> > > >> > >>>>> create
> > > >> > >>>>>> a
> > > >> > >>>>>>>>> Continuation that had two condition variables.  It would
> > > >> assign
> > > >> > >>>> the
> > > >> > >>>>>>> mutex
> > > >> > >>>>>>>>> to this Continuation and schedule it.  (In this case, it
> > > might
> > > >> > >>>> make
> > > >> > >>>>>>> sense
> > > >> > >>>>>>>>> to put this Continuation at the front of the mutex's
> > queue,
> > > >> since
> > > >> > >>>>> it
> > > >> > >>>>>>>> would
> > > >> > >>>>>>>>> be blocking an entire event thread.)  The direct-lock
> > > function
> > > >> > >>>>> would
> > > >> > >>>>>>> then
> > > >> > >>>>>>>>> block on the first condition variable.  When the
> > > Continuation
> > > >> > >>>> ran,
> > > >> > >>>>> it
> > > >> > >>>>>>>> would
> > > >> > >>>>>>>>> trigger the first condition variable, and then block on
> > the
> > > >> > >>>> second
> > > >> > >>>>>>>>> condition variable.  The direct-lock function would then
> > > exit,
> > > >> > >>>>>> allowing
> > > >> > >>>>>>>> the
> > > >> > >>>>>>>>> calling code to enter its critical section.  At the end
> of
> > > the
> > > >> > >>>>>> critical
> > > >> > >>>>>>>>> section, another function to release the direct lock
> would
> > > be
> > > >> > >>>>> called.
> > > >> > >>>>>>> It
> > > >> > >>>>>>>>> would trigger the second condition variable, which would
> > > cause
> > > >> > >>>> the
> > > >> > >>>>>>>> function
> > > >> > >>>>>>>>> of the Continuation created for the direct lock to exit
> > > (thus
> > > >> > >>>>>> releasing
> > > >> > >>>>>>>> the
> > > >> > >>>>>>>>> mutex).
> > > >> > >>>>>>>>>
> > > >> > >>>>>>>>> With this approach, I'm not sure thread affinities would
> > be
> > > of
> > > >> > >>>> any
> > > >> > >>>>>>> value.
> > > >> > >>>>>>>>> I think perhaps each core should have it's own list of
> > > >> > >>>> ready-to-run
> > > >> > >>>>>>>>> Continuations, and a pool of event threads with affinity
> > to
> > > >> that
> > > >> > >>>>>> core.
> > > >> > >>>>>>>> Not
> > > >> > >>>>>>>>> having per-event-thread ready-to-run lists means that a
> > > >> > >>>>> Continuation
> > > >> > >>>>>>>>> function that blocks is less likely to block other
> > > >> ready-to-run
> > > >> > >>>>>>>>> Continuations.  If Continuations had core affinities to
> > some
> > > >> > >>>>> degree,
> > > >> > >>>>>>> this
> > > >> > >>>>>>>>> might reduce evictions in per-core memory cache.
> > (Multiple
> > > >> > >>>>>>> Continuations
> > > >> > >>>>>>>>> having the same function should have the same core
> > > affinity.)
> > > >> > >>>>>>>>>
> > > >> > >>>>>>>>
> > > >> > >>>>>>>
> > > >> > >>>>>>
> > > >> > >>>>>
> > > >> > >>>>
> > > >> > >>
> > > >> > >>
> > > >> >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: Tentative proposal: TS mutexes should be queues not OS mutexes

Posted by Walt Karas <wk...@verizonmedia.com.INVALID>.
Thread affinity seems like a worthwhile incremental improvement for the
current approach.  I'm just offering alternatives approach for
consideration.

Were any measurements sent to this mailing list or recorded somewhere else?

On Thu, Oct 3, 2019 at 9:14 AM Alan Carroll
<so...@verizonmedia.com.invalid> wrote:

> I didn't see the thread affinity as a rathole, but more as a regularization
> of existing practice. That kind of thing was already being done in many
> places in an ad hoc manner (e.g. cache, DNS, VC migration). Putting it in
> the Thread support classes meant there was a single implementation, not
> several, with the consequent and highly desired consistency of operation.
>
> I also considered it very desirable for plugins to be able to do similar
> things, which is why I championed Kees' request on that. Any plugin doing
> asynchronous operations found it difficult to impossible to do the correct
> thing. Now it's easy and mostly automatic. As for tests on that, we
> certainly had mysterious crashes and other problems due to the lack of such
> a mechanism and even now we're still cleaning up plugins that do bad things
> with scheduling Continuations. The difference now is we can put in a clean
> fix rather than obscure hacks. This went through because it addressed, as
> Kees noted, a well known and chronic issue in a reasonable way, formalizing
> what had been unwritten rules.
>
> Let me note again, we considered this previously, did actual measurements,
> and concluded the gains were much too small to justify the code (except in
> the cache, which does have hot locks). And as noted above, given that we
> were already doing crude forms of thread affinity, it seemed making that
> more general and consistent was the best approach to reducing lock
> contention, especially as it provided other benefits.
>
> On Wed, Oct 2, 2019 at 4:30 PM Walt Karas <wkaras@verizonmedia.com
> .invalid>
> wrote:
>
> > On Wed, Oct 2, 2019 at 3:19 PM Kees Spoelstra <ks...@we-amp.com>
> > wrote:
> >
> > > I think we're all able to dig some ratholes with less code than 30
> lines.
> > >
> >
> > Him, I thought "rathole" implied unneeded complexity but it seems to be
> > just a synonym for "bad".
> >
> >
> > >
> > > Alan, it's only my fault that I pushed for threadaffinity for plugins
> > > after I saw how it was in core... Can't keep all the goodies for the
> core
> > > :)
> > >
> > > One thing to watch out for is in order execution of continuations, as
> > I've
> > > mentioned earlier there is a big possibility that there is code which
> > does
> > > not properly lock ( or does not need to) and might get executing
> > > continuations at the same time instead of in order when running
> > > concurrently.
> > > Sometimes to get out of the current executing context which might be
> deep
> > > in the stack (x locks down and before some cleanup) you will schedule
> > > another continuation on the same thread, which can give you a cleaner
> > > context.
> > > Imagine the scheduler picking that up and running it concurrently on
> > > another thread on the same core, that might yield weird results.
> > > Multiple threads might break that implicit thread safety
> > >
> >
> > When one Continuation schedules another, it could go into a queue of
> > Continuations that become runnable after the current Continuation exits.
> > So the pseudo-code becomes:
> >
> > bool idle[Num_workers_pre_core];
> > Condition_variable go[Num_workers_per_core];
> > Continuation *curr[Num_workers_per_core];
> > Continuation_queue after[Num_workers_per_core]; // Run after current
> > continuation.
> >
> > while (true)
> > {
> >   blocking get on message queue;
> >   switch (msg.type)
> >   {
> >     case WORKER_THREAD_READY_FOR_NEXT:  // Sent by worker thread.
> >       while (after[msg.worker_index} is not empty) {
> >         c = pop from after[msg.worker_index];
> >         push c to run queue;
> >       }
> >       if (run queue empty) {
> >         idle[msg.worker_index] = true;
> >       } else {
> >         curr[msg.worker_index] = dequeue from run queue;
> >         trigger go[msg.worker_index];
> >       }
> >     break;
> >
> >     case QUEUE_CONTINUATION_TO_RUN:
> >       run_index = index where idle[index] is true or none;
> >       if (run_index == none) {
> >         enqueue msg.continuation in run queue;
> >       } else {
> >         idel[run_index] = false;
> >         curr[run_index] = msg.continuation;
> >         trigger go[run_index];
> >       }
> >     break;
> >
> >   } // end switch
> > } // end while
> >
> > On the other hand, it's not that hard for a specific Continuation to not
> > trigger successive Continuations until they are actually ready to run.
> It
> > would simply need to have its own "after" queue.  We should be selective
> > about how many capabilities we pull into the core Event execution code,
> and
> > how much we damage general behavior by doing so.
> >
> >
> > >
> > > Locking mostly gets/got screwy in the plugins, because of possible
> > > thread/core jumping, the plugin affinity functions help a lot with
> that.
> > > There are patterns to prevent that, try lock, but they're a bit of a
> pain
> > > and I've seen people ignore that.
> > >
> > > Some findings we had when doing h2. Thread affinity might not always
> > yield
> > > the best latency on an almost idle system if the protocols on both
> sides
> > > are relatively heavy weight the packing and unpacking will take time
> and
> > > one can imagine there is some parallelism which could decrease latency.
> > > We've thought about controlled handover to other threads, but never
> made
> > > any work of it.
> > >
> > > I can imagine with quic coming that we might see some thread jumping
> > > again, haven't been following progress on that one.
> > >
> > > On Wed, 2 Oct 2019, 21:33 Walt Karas, <wkaras@verizonmedia.com
> .invalid>
> > > wrote:
> > >
> > >> On Wed, Oct 2, 2019 at 2:25 PM Leif Hedstrom <zw...@apache.org>
> wrote:
> > >>
> > >> >
> > >> >
> > >> > > On Oct 2, 2019, at 1:00 PM, Walt Karas <wkaras@verizonmedia.com
> > >> .INVALID>
> > >> > wrote:
> > >> > >
> > >> > > I feel like you overestimate me if you think I could create a
> > rathole
> > >> > with
> > >> > > just 30 lines of pseudo-code.
> > >> >
> > >> > We have faith in you.
> > >> >
> > >>
> > >> What's an example of a rathole I've already created?  Seems like a
> very
> > >> big
> > >> leap of faith if there isn't one.
> > >>
> > >> Question:  Why do we have more event threads than cores?
> > >>
> > >> If it's to mitigate the fact that ATS is not purist event-driven, that
> > >> Continuations sometimes do block, it makes sense to use the best
> design
> > to
> > >> mitigate this.  Sorry but I don't think any big profiling hoedown is
> > >> needed
> > >> to see that what I'm proposing mitigates it better.  It's essentially
> > the
> > >> same idea as having a single line to pay even when there are multiple
> > >> cashiers.
> > >>
> > >>
> > >> >
> > >> > >
> > >> > > We're making complex thread affinity changes, on top of an event
> > >> > execution
> > >> > > code that already seem complex.  (I leave it to those with more
> > >> rathole
> > >> > > background to say if it should be so labeled.) . Can I see the
> > >> profiling
> > >> > > results used to justify and guide the thread affinity design?
> > >> >
> > >> >
> > >> > +1.
> > >> >
> > >> > — leif
> > >> >
> > >> > >
> > >> > > On Wed, Oct 2, 2019 at 1:32 PM Leif Hedstrom <zw...@apache.org>
> > >> wrote:
> > >> > >
> > >> > >>
> > >> > >>
> > >> > >>> On Oct 2, 2019, at 11:23 AM, Walt Karas <
> wkaras@verizonmedia.com
> > >> > .INVALID>
> > >> > >> wrote:
> > >> > >>>
> > >> > >>> What's the best tool for multi-threaded profiling on Linux?
> > >> > >>
> > >> > >> Probably the Intel tools.  Probably the “perf” toolchain works
> > well,
> > >> at
> > >> > >> least on modern linux, you could do perf lock etc.
> > >> > >>
> > >> > >>>
> > >> > >>> On Wed, Oct 2, 2019 at 10:14 AM Alan Carroll
> > >> > >>> <so...@verizonmedia.com.invalid> wrote:
> > >> > >>>
> > >> > >>>> Correct, it doesn't mean no lock contention. The claim was it
> > >> reduced
> > >> > >> the
> > >> > >>>> lock contention to a level where it's not significant enough to
> > >> > warrant
> > >> > >>>> additional preventative measures. The data Dan got wasn't from
> > code
> > >> > >>>> analysis, but from run time profiling. That was a while ago so
> if
> > >> > you'd
> > >> > >>>> like to perform another pass of measuring the level of lock
> > >> > contention,
> > >> > >>>> that would certainly be interesting data.
> > >> > >>
> > >> > >>
> > >> > >> So, before we dig outselves into this rathole, can someone
> explain
> > >> to me
> > >> > >> what the problem is? Where are the metrics / analysis showing
> that
> > we
> > >> > have
> > >> > >> a problem? I’d love to see a comparison too between various
> version
> > >> of
> > >> > ATS,
> > >> > >> say v6 - v9, and understand where, if anywhere, we made things
> > (lock
> > >> > >> contention) worse.
> > >> > >>
> > >> > >> We have to stop making complicated and invasive solutions without
> > >> real
> > >> > >> problems, and understanding such problem.
> > >> > >>
> > >> > >> My $0.01,
> > >> > >>
> > >> > >> — Leif
> > >> > >>
> > >> > >>>>
> > >> > >>>> In addition, the push for thread affinity started from actual
> > >> issues
> > >> > in
> > >> > >>>> production with Continuations being scheduled on different
> > threads
> > >> of
> > >> > >> the
> > >> > >>>> same type (that is, it was Kees' fault). Those would not be
> > >> resolved
> > >> > by
> > >> > >>>> faster scheduling on different threads.
> > >> > >>>>
> > >> > >>>> On Tue, Oct 1, 2019 at 11:49 AM Walt Karas <
> > >> wkaras@verizonmedia.com
> > >> > >>>> .invalid>
> > >> > >>>> wrote:
> > >> > >>>>
> > >> > >>>>> I assume thread affinity can't literal mean no lock
> contention.
> > >> > You'd
> > >> > >>>> need
> > >> > >>>>> a lock on the thread run queue wouldn't you?  Continuations
> > can't
> > >> > only
> > >> > >>>> get
> > >> > >>>>> queued for the event thread from the event thread.  I don't
> > think
> > >> we
> > >> > >> can
> > >> > >>>>> say conclusively that there would be a significant difference
> > due
> > >> to
> > >> > >> lock
> > >> > >>>>> contention.  I'm guessing that Fei would agree that the
> > >> Continuation
> > >> > >>>>> dispatch code is difficult to understand and work on.
> > >> Simplification
> > >> > >> and
> > >> > >>>>> more modularity is obviously a goal.  Seems like it would be
> > >> simpler
> > >> > if
> > >> > >>>> all
> > >> > >>>>> the Continuations in a to-run list where actually ready to
> run.
> > >> > >>>>>
> > >> > >>>>> On Tue, Oct 1, 2019 at 9:22 AM Alan Carroll
> > >> > >>>>> <so...@verizonmedia.com.invalid> wrote:
> > >> > >>>>>
> > >> > >>>>>> Do you have any more specific information on mutex
> contention?
> > We
> > >> > have
> > >> > >>>> in
> > >> > >>>>>> fact already looked at doing this, I think back in 2015 with
> > Dan
> > >> Xu.
> > >> > >>>> The
> > >> > >>>>>> goal there was to have queues with the mutexes to avoid
> > >> > rescheduling.
> > >> > >>>> As
> > >> > >>>>> a
> > >> > >>>>>> precursor Dan did some profiling and the only significant
> lock
> > >> > >>>> contention
> > >> > >>>>>> he could find was in the cache. That lead to the partial
> object
> > >> > >> caching
> > >> > >>>>>> work setting up queues for the hot locks, but it was decided
> > that
> > >> > >> given
> > >> > >>>>> the
> > >> > >>>>>> lack of
> > >> > >>>>>> contention elsewhere, it wasn't worth the complexity.
> > >> > >>>>>>
> > >> > >>>>>> I think thread affinity is a better choice because no lock
> > >> > contention
> > >> > >>>>> will
> > >> > >>>>>> always beat even the most optimized lock contention
> resolution.
> > >> If
> > >> > >>>>>> Continuations related to the same constellation of data
> objects
> > >> are
> > >> > on
> > >> > >>>>> the
> > >> > >>>>>> same thread, then the locks are never contested, which makes
> it
> > >> as
> > >> > >> fast
> > >> > >>>>> as
> > >> > >>>>>> possible.
> > >> > >>>>>>
> > >> > >>>>>> On Mon, Sep 30, 2019 at 3:45 PM Walt Karas <
> > >> wkaras@verizonmedia.com
> > >> > >>>>>> .invalid>
> > >> > >>>>>> wrote:
> > >> > >>>>>>
> > >> > >>>>>>> From the longer-term TSers I've heard comments about seeing
> > >> > profiling
> > >> > >>>>>>> results that show that waiting on mutexes is a significant
> > >> > >>>> performance
> > >> > >>>>>>> issue with TS.  But I'm not aware of any write-ups of such
> > >> results.
> > >> > >>>>>>> Unfortunately, I'm relatively new to TS and Linux, so I'm
> not
> > >> > >>>> currently
> > >> > >>>>>>> familiar with the best approaches to profiling TS.
> > >> > >>>>>>>
> > >> > >>>>>>> For better performance, I think that having a single to-run
> > >> > >>>>> Continuation
> > >> > >>>>>>> queue, or one per core, with a queue feeding multiple event
> > >> threads
> > >> > >>>> is
> > >> > >>>>>> the
> > >> > >>>>>>> main thing.  It's more resilient to Continuations that
> block.
> > >> > There
> > >> > >>>>>>> doesn't seem to be enthusiasm for getting hard-core about
> not
> > >> > having
> > >> > >>>>>>> blocking Continuations (see
> > >> > >>>>>>> https://github.com/apache/trafficserver/pull/5412 ).  I'm
> not
> > >> sure
> > >> > >>>>>>> changing
> > >> > >>>>>>> to queue-based mutexes would have a significant performance
> > >> impact.
> > >> > >>>>> But
> > >> > >>>>>> it
> > >> > >>>>>>> seems a cleaner design, making sure Continuations in the
> > to-run
> > >> > >>>> list(s)
> > >> > >>>>>> are
> > >> > >>>>>>> actually ready to run.  But a different mutex implementation
> > is
> > >> not
> > >> > >>>>>>> strictly necessary in order to consolidate to-run
> Continuation
> > >> > >>>> queues.
> > >> > >>>>>>>
> > >> > >>>>>>> On Mon, Sep 30, 2019 at 2:39 PM Kees Spoelstra <
> > >> > >>>> kspoelstra@we-amp.com>
> > >> > >>>>>>> wrote:
> > >> > >>>>>>>
> > >> > >>>>>>>> Sounds very interesting.
> > >> > >>>>>>>> But what is the problem we're trying to solve here, I like
> > the
> > >> > >>>> thread
> > >> > >>>>>>>> affinity because it gives us head ache free concurrency in
> > some
> > >> > >>>>> cases,
> > >> > >>>>>>> and
> > >> > >>>>>>>> I'll bet that there is some code which doesn't have the
> > proper
> > >> > >>>>>>> continuation
> > >> > >>>>>>>> mutexes because we know it runs on the same thread.
> > >> > >>>>>>>>
> > >> > >>>>>>>> Are we seeing a lot of imbalanced threads (too much
> > processing
> > >> > >>>>> causing
> > >> > >>>>>>> long
> > >> > >>>>>>>> queues of continuations, which I can imagine in some
> cases) ?
> > >> And
> > >> > >>>>>>> shouldn't
> > >> > >>>>>>>> we balance based on transactions or connections, move those
> > >> around
> > >> > >>>>> when
> > >> > >>>>>>> we
> > >> > >>>>>>>> see imbalance and aim for embarrassingly parallel
> processing
> > :)
> > >> > >>>> Come
> > >> > >>>>>>>> to think of it, this might introduce another set of
> problems,
> > >> how
> > >> > >>>> to
> > >> > >>>>>> know
> > >> > >>>>>>>> which continuations are part of the life cycle of a
> > connection
> > >> :/
> > >> > >>>>>>>>
> > >> > >>>>>>>> Jumping threads in one transaction is not always ideal
> > either,
> > >> > this
> > >> > >>>>> can
> > >> > >>>>>>>> really hurt performance. But your proposed model seems to
> > >> handle
> > >> > >>>> that
> > >> > >>>>>>>> somewhat better than the current implementation.
> > >> > >>>>>>>>
> > >> > >>>>>>>> Very interested and wondering what this would mean for
> plugin
> > >> > >>>>>> developers.
> > >> > >>>>>>>>
> > >> > >>>>>>>> On Mon, 30 Sep 2019, 19:20 Walt Karas, <
> > >> wkaras@verizonmedia.com
> > >> > >>>>>> .invalid>
> > >> > >>>>>>>> wrote:
> > >> > >>>>>>>>
> > >> > >>>>>>>>> If a Continuation is scheduled, but its mutex is locked,
> > it's
> > >> put
> > >> > >>>>> in
> > >> > >>>>>> a
> > >> > >>>>>>>>> queue specific to that mutex.  The release function for
> the
> > >> mutex
> > >> > >>>>>>> (called
> > >> > >>>>>>>>> when a Continuation holding the mutex exists) would put
> the
> > >> > >>>>>>> Continuation
> > >> > >>>>>>>> at
> > >> > >>>>>>>>> the front of the mutex's queue (if not empty) into the
> > >> > >>>> ready-to-run
> > >> > >>>>>>> queue
> > >> > >>>>>>>>> (transferring the lock to that Continuation).  A drawback
> is
> > >> that
> > >> > >>>>> the
> > >> > >>>>>>>> queue
> > >> > >>>>>>>>> would itself need a mutex (spinlock?), but the critical
> > >> section
> > >> > >>>>> would
> > >> > >>>>>>> be
> > >> > >>>>>>>>> very short.
> > >> > >>>>>>>>>
> > >> > >>>>>>>>> There would be a function to lock a mutex directly.  It
> > would
> > >> > >>>>> create
> > >> > >>>>>> a
> > >> > >>>>>>>>> Continuation that had two condition variables.  It would
> > >> assign
> > >> > >>>> the
> > >> > >>>>>>> mutex
> > >> > >>>>>>>>> to this Continuation and schedule it.  (In this case, it
> > might
> > >> > >>>> make
> > >> > >>>>>>> sense
> > >> > >>>>>>>>> to put this Continuation at the front of the mutex's
> queue,
> > >> since
> > >> > >>>>> it
> > >> > >>>>>>>> would
> > >> > >>>>>>>>> be blocking an entire event thread.)  The direct-lock
> > function
> > >> > >>>>> would
> > >> > >>>>>>> then
> > >> > >>>>>>>>> block on the first condition variable.  When the
> > Continuation
> > >> > >>>> ran,
> > >> > >>>>> it
> > >> > >>>>>>>> would
> > >> > >>>>>>>>> trigger the first condition variable, and then block on
> the
> > >> > >>>> second
> > >> > >>>>>>>>> condition variable.  The direct-lock function would then
> > exit,
> > >> > >>>>>> allowing
> > >> > >>>>>>>> the
> > >> > >>>>>>>>> calling code to enter its critical section.  At the end of
> > the
> > >> > >>>>>> critical
> > >> > >>>>>>>>> section, another function to release the direct lock would
> > be
> > >> > >>>>> called.
> > >> > >>>>>>> It
> > >> > >>>>>>>>> would trigger the second condition variable, which would
> > cause
> > >> > >>>> the
> > >> > >>>>>>>> function
> > >> > >>>>>>>>> of the Continuation created for the direct lock to exit
> > (thus
> > >> > >>>>>> releasing
> > >> > >>>>>>>> the
> > >> > >>>>>>>>> mutex).
> > >> > >>>>>>>>>
> > >> > >>>>>>>>> With this approach, I'm not sure thread affinities would
> be
> > of
> > >> > >>>> any
> > >> > >>>>>>> value.
> > >> > >>>>>>>>> I think perhaps each core should have it's own list of
> > >> > >>>> ready-to-run
> > >> > >>>>>>>>> Continuations, and a pool of event threads with affinity
> to
> > >> that
> > >> > >>>>>> core.
> > >> > >>>>>>>> Not
> > >> > >>>>>>>>> having per-event-thread ready-to-run lists means that a
> > >> > >>>>> Continuation
> > >> > >>>>>>>>> function that blocks is less likely to block other
> > >> ready-to-run
> > >> > >>>>>>>>> Continuations.  If Continuations had core affinities to
> some
> > >> > >>>>> degree,
> > >> > >>>>>>> this
> > >> > >>>>>>>>> might reduce evictions in per-core memory cache.
> (Multiple
> > >> > >>>>>>> Continuations
> > >> > >>>>>>>>> having the same function should have the same core
> > affinity.)
> > >> > >>>>>>>>>
> > >> > >>>>>>>>
> > >> > >>>>>>>
> > >> > >>>>>>
> > >> > >>>>>
> > >> > >>>>
> > >> > >>
> > >> > >>
> > >> >
> > >> >
> > >>
> > >
> >
>

Re: Tentative proposal: TS mutexes should be queues not OS mutexes

Posted by Alan Carroll <so...@verizonmedia.com.INVALID>.
I didn't see the thread affinity as a rathole, but more as a regularization
of existing practice. That kind of thing was already being done in many
places in an ad hoc manner (e.g. cache, DNS, VC migration). Putting it in
the Thread support classes meant there was a single implementation, not
several, with the consequent and highly desired consistency of operation.

I also considered it very desirable for plugins to be able to do similar
things, which is why I championed Kees' request on that. Any plugin doing
asynchronous operations found it difficult to impossible to do the correct
thing. Now it's easy and mostly automatic. As for tests on that, we
certainly had mysterious crashes and other problems due to the lack of such
a mechanism and even now we're still cleaning up plugins that do bad things
with scheduling Continuations. The difference now is we can put in a clean
fix rather than obscure hacks. This went through because it addressed, as
Kees noted, a well known and chronic issue in a reasonable way, formalizing
what had been unwritten rules.

Let me note again, we considered this previously, did actual measurements,
and concluded the gains were much too small to justify the code (except in
the cache, which does have hot locks). And as noted above, given that we
were already doing crude forms of thread affinity, it seemed making that
more general and consistent was the best approach to reducing lock
contention, especially as it provided other benefits.

On Wed, Oct 2, 2019 at 4:30 PM Walt Karas <wk...@verizonmedia.com.invalid>
wrote:

> On Wed, Oct 2, 2019 at 3:19 PM Kees Spoelstra <ks...@we-amp.com>
> wrote:
>
> > I think we're all able to dig some ratholes with less code than 30 lines.
> >
>
> Him, I thought "rathole" implied unneeded complexity but it seems to be
> just a synonym for "bad".
>
>
> >
> > Alan, it's only my fault that I pushed for threadaffinity for plugins
> > after I saw how it was in core... Can't keep all the goodies for the core
> > :)
> >
> > One thing to watch out for is in order execution of continuations, as
> I've
> > mentioned earlier there is a big possibility that there is code which
> does
> > not properly lock ( or does not need to) and might get executing
> > continuations at the same time instead of in order when running
> > concurrently.
> > Sometimes to get out of the current executing context which might be deep
> > in the stack (x locks down and before some cleanup) you will schedule
> > another continuation on the same thread, which can give you a cleaner
> > context.
> > Imagine the scheduler picking that up and running it concurrently on
> > another thread on the same core, that might yield weird results.
> > Multiple threads might break that implicit thread safety
> >
>
> When one Continuation schedules another, it could go into a queue of
> Continuations that become runnable after the current Continuation exits.
> So the pseudo-code becomes:
>
> bool idle[Num_workers_pre_core];
> Condition_variable go[Num_workers_per_core];
> Continuation *curr[Num_workers_per_core];
> Continuation_queue after[Num_workers_per_core]; // Run after current
> continuation.
>
> while (true)
> {
>   blocking get on message queue;
>   switch (msg.type)
>   {
>     case WORKER_THREAD_READY_FOR_NEXT:  // Sent by worker thread.
>       while (after[msg.worker_index} is not empty) {
>         c = pop from after[msg.worker_index];
>         push c to run queue;
>       }
>       if (run queue empty) {
>         idle[msg.worker_index] = true;
>       } else {
>         curr[msg.worker_index] = dequeue from run queue;
>         trigger go[msg.worker_index];
>       }
>     break;
>
>     case QUEUE_CONTINUATION_TO_RUN:
>       run_index = index where idle[index] is true or none;
>       if (run_index == none) {
>         enqueue msg.continuation in run queue;
>       } else {
>         idel[run_index] = false;
>         curr[run_index] = msg.continuation;
>         trigger go[run_index];
>       }
>     break;
>
>   } // end switch
> } // end while
>
> On the other hand, it's not that hard for a specific Continuation to not
> trigger successive Continuations until they are actually ready to run.  It
> would simply need to have its own "after" queue.  We should be selective
> about how many capabilities we pull into the core Event execution code, and
> how much we damage general behavior by doing so.
>
>
> >
> > Locking mostly gets/got screwy in the plugins, because of possible
> > thread/core jumping, the plugin affinity functions help a lot with that.
> > There are patterns to prevent that, try lock, but they're a bit of a pain
> > and I've seen people ignore that.
> >
> > Some findings we had when doing h2. Thread affinity might not always
> yield
> > the best latency on an almost idle system if the protocols on both sides
> > are relatively heavy weight the packing and unpacking will take time and
> > one can imagine there is some parallelism which could decrease latency.
> > We've thought about controlled handover to other threads, but never made
> > any work of it.
> >
> > I can imagine with quic coming that we might see some thread jumping
> > again, haven't been following progress on that one.
> >
> > On Wed, 2 Oct 2019, 21:33 Walt Karas, <wk...@verizonmedia.com.invalid>
> > wrote:
> >
> >> On Wed, Oct 2, 2019 at 2:25 PM Leif Hedstrom <zw...@apache.org> wrote:
> >>
> >> >
> >> >
> >> > > On Oct 2, 2019, at 1:00 PM, Walt Karas <wkaras@verizonmedia.com
> >> .INVALID>
> >> > wrote:
> >> > >
> >> > > I feel like you overestimate me if you think I could create a
> rathole
> >> > with
> >> > > just 30 lines of pseudo-code.
> >> >
> >> > We have faith in you.
> >> >
> >>
> >> What's an example of a rathole I've already created?  Seems like a very
> >> big
> >> leap of faith if there isn't one.
> >>
> >> Question:  Why do we have more event threads than cores?
> >>
> >> If it's to mitigate the fact that ATS is not purist event-driven, that
> >> Continuations sometimes do block, it makes sense to use the best design
> to
> >> mitigate this.  Sorry but I don't think any big profiling hoedown is
> >> needed
> >> to see that what I'm proposing mitigates it better.  It's essentially
> the
> >> same idea as having a single line to pay even when there are multiple
> >> cashiers.
> >>
> >>
> >> >
> >> > >
> >> > > We're making complex thread affinity changes, on top of an event
> >> > execution
> >> > > code that already seem complex.  (I leave it to those with more
> >> rathole
> >> > > background to say if it should be so labeled.) . Can I see the
> >> profiling
> >> > > results used to justify and guide the thread affinity design?
> >> >
> >> >
> >> > +1.
> >> >
> >> > — leif
> >> >
> >> > >
> >> > > On Wed, Oct 2, 2019 at 1:32 PM Leif Hedstrom <zw...@apache.org>
> >> wrote:
> >> > >
> >> > >>
> >> > >>
> >> > >>> On Oct 2, 2019, at 11:23 AM, Walt Karas <wkaras@verizonmedia.com
> >> > .INVALID>
> >> > >> wrote:
> >> > >>>
> >> > >>> What's the best tool for multi-threaded profiling on Linux?
> >> > >>
> >> > >> Probably the Intel tools.  Probably the “perf” toolchain works
> well,
> >> at
> >> > >> least on modern linux, you could do perf lock etc.
> >> > >>
> >> > >>>
> >> > >>> On Wed, Oct 2, 2019 at 10:14 AM Alan Carroll
> >> > >>> <so...@verizonmedia.com.invalid> wrote:
> >> > >>>
> >> > >>>> Correct, it doesn't mean no lock contention. The claim was it
> >> reduced
> >> > >> the
> >> > >>>> lock contention to a level where it's not significant enough to
> >> > warrant
> >> > >>>> additional preventative measures. The data Dan got wasn't from
> code
> >> > >>>> analysis, but from run time profiling. That was a while ago so if
> >> > you'd
> >> > >>>> like to perform another pass of measuring the level of lock
> >> > contention,
> >> > >>>> that would certainly be interesting data.
> >> > >>
> >> > >>
> >> > >> So, before we dig outselves into this rathole, can someone explain
> >> to me
> >> > >> what the problem is? Where are the metrics / analysis showing that
> we
> >> > have
> >> > >> a problem? I’d love to see a comparison too between various version
> >> of
> >> > ATS,
> >> > >> say v6 - v9, and understand where, if anywhere, we made things
> (lock
> >> > >> contention) worse.
> >> > >>
> >> > >> We have to stop making complicated and invasive solutions without
> >> real
> >> > >> problems, and understanding such problem.
> >> > >>
> >> > >> My $0.01,
> >> > >>
> >> > >> — Leif
> >> > >>
> >> > >>>>
> >> > >>>> In addition, the push for thread affinity started from actual
> >> issues
> >> > in
> >> > >>>> production with Continuations being scheduled on different
> threads
> >> of
> >> > >> the
> >> > >>>> same type (that is, it was Kees' fault). Those would not be
> >> resolved
> >> > by
> >> > >>>> faster scheduling on different threads.
> >> > >>>>
> >> > >>>> On Tue, Oct 1, 2019 at 11:49 AM Walt Karas <
> >> wkaras@verizonmedia.com
> >> > >>>> .invalid>
> >> > >>>> wrote:
> >> > >>>>
> >> > >>>>> I assume thread affinity can't literal mean no lock contention.
> >> > You'd
> >> > >>>> need
> >> > >>>>> a lock on the thread run queue wouldn't you?  Continuations
> can't
> >> > only
> >> > >>>> get
> >> > >>>>> queued for the event thread from the event thread.  I don't
> think
> >> we
> >> > >> can
> >> > >>>>> say conclusively that there would be a significant difference
> due
> >> to
> >> > >> lock
> >> > >>>>> contention.  I'm guessing that Fei would agree that the
> >> Continuation
> >> > >>>>> dispatch code is difficult to understand and work on.
> >> Simplification
> >> > >> and
> >> > >>>>> more modularity is obviously a goal.  Seems like it would be
> >> simpler
> >> > if
> >> > >>>> all
> >> > >>>>> the Continuations in a to-run list where actually ready to run.
> >> > >>>>>
> >> > >>>>> On Tue, Oct 1, 2019 at 9:22 AM Alan Carroll
> >> > >>>>> <so...@verizonmedia.com.invalid> wrote:
> >> > >>>>>
> >> > >>>>>> Do you have any more specific information on mutex contention?
> We
> >> > have
> >> > >>>> in
> >> > >>>>>> fact already looked at doing this, I think back in 2015 with
> Dan
> >> Xu.
> >> > >>>> The
> >> > >>>>>> goal there was to have queues with the mutexes to avoid
> >> > rescheduling.
> >> > >>>> As
> >> > >>>>> a
> >> > >>>>>> precursor Dan did some profiling and the only significant lock
> >> > >>>> contention
> >> > >>>>>> he could find was in the cache. That lead to the partial object
> >> > >> caching
> >> > >>>>>> work setting up queues for the hot locks, but it was decided
> that
> >> > >> given
> >> > >>>>> the
> >> > >>>>>> lack of
> >> > >>>>>> contention elsewhere, it wasn't worth the complexity.
> >> > >>>>>>
> >> > >>>>>> I think thread affinity is a better choice because no lock
> >> > contention
> >> > >>>>> will
> >> > >>>>>> always beat even the most optimized lock contention resolution.
> >> If
> >> > >>>>>> Continuations related to the same constellation of data objects
> >> are
> >> > on
> >> > >>>>> the
> >> > >>>>>> same thread, then the locks are never contested, which makes it
> >> as
> >> > >> fast
> >> > >>>>> as
> >> > >>>>>> possible.
> >> > >>>>>>
> >> > >>>>>> On Mon, Sep 30, 2019 at 3:45 PM Walt Karas <
> >> wkaras@verizonmedia.com
> >> > >>>>>> .invalid>
> >> > >>>>>> wrote:
> >> > >>>>>>
> >> > >>>>>>> From the longer-term TSers I've heard comments about seeing
> >> > profiling
> >> > >>>>>>> results that show that waiting on mutexes is a significant
> >> > >>>> performance
> >> > >>>>>>> issue with TS.  But I'm not aware of any write-ups of such
> >> results.
> >> > >>>>>>> Unfortunately, I'm relatively new to TS and Linux, so I'm not
> >> > >>>> currently
> >> > >>>>>>> familiar with the best approaches to profiling TS.
> >> > >>>>>>>
> >> > >>>>>>> For better performance, I think that having a single to-run
> >> > >>>>> Continuation
> >> > >>>>>>> queue, or one per core, with a queue feeding multiple event
> >> threads
> >> > >>>> is
> >> > >>>>>> the
> >> > >>>>>>> main thing.  It's more resilient to Continuations that block.
> >> > There
> >> > >>>>>>> doesn't seem to be enthusiasm for getting hard-core about not
> >> > having
> >> > >>>>>>> blocking Continuations (see
> >> > >>>>>>> https://github.com/apache/trafficserver/pull/5412 ).  I'm not
> >> sure
> >> > >>>>>>> changing
> >> > >>>>>>> to queue-based mutexes would have a significant performance
> >> impact.
> >> > >>>>> But
> >> > >>>>>> it
> >> > >>>>>>> seems a cleaner design, making sure Continuations in the
> to-run
> >> > >>>> list(s)
> >> > >>>>>> are
> >> > >>>>>>> actually ready to run.  But a different mutex implementation
> is
> >> not
> >> > >>>>>>> strictly necessary in order to consolidate to-run Continuation
> >> > >>>> queues.
> >> > >>>>>>>
> >> > >>>>>>> On Mon, Sep 30, 2019 at 2:39 PM Kees Spoelstra <
> >> > >>>> kspoelstra@we-amp.com>
> >> > >>>>>>> wrote:
> >> > >>>>>>>
> >> > >>>>>>>> Sounds very interesting.
> >> > >>>>>>>> But what is the problem we're trying to solve here, I like
> the
> >> > >>>> thread
> >> > >>>>>>>> affinity because it gives us head ache free concurrency in
> some
> >> > >>>>> cases,
> >> > >>>>>>> and
> >> > >>>>>>>> I'll bet that there is some code which doesn't have the
> proper
> >> > >>>>>>> continuation
> >> > >>>>>>>> mutexes because we know it runs on the same thread.
> >> > >>>>>>>>
> >> > >>>>>>>> Are we seeing a lot of imbalanced threads (too much
> processing
> >> > >>>>> causing
> >> > >>>>>>> long
> >> > >>>>>>>> queues of continuations, which I can imagine in some cases) ?
> >> And
> >> > >>>>>>> shouldn't
> >> > >>>>>>>> we balance based on transactions or connections, move those
> >> around
> >> > >>>>> when
> >> > >>>>>>> we
> >> > >>>>>>>> see imbalance and aim for embarrassingly parallel processing
> :)
> >> > >>>> Come
> >> > >>>>>>>> to think of it, this might introduce another set of problems,
> >> how
> >> > >>>> to
> >> > >>>>>> know
> >> > >>>>>>>> which continuations are part of the life cycle of a
> connection
> >> :/
> >> > >>>>>>>>
> >> > >>>>>>>> Jumping threads in one transaction is not always ideal
> either,
> >> > this
> >> > >>>>> can
> >> > >>>>>>>> really hurt performance. But your proposed model seems to
> >> handle
> >> > >>>> that
> >> > >>>>>>>> somewhat better than the current implementation.
> >> > >>>>>>>>
> >> > >>>>>>>> Very interested and wondering what this would mean for plugin
> >> > >>>>>> developers.
> >> > >>>>>>>>
> >> > >>>>>>>> On Mon, 30 Sep 2019, 19:20 Walt Karas, <
> >> wkaras@verizonmedia.com
> >> > >>>>>> .invalid>
> >> > >>>>>>>> wrote:
> >> > >>>>>>>>
> >> > >>>>>>>>> If a Continuation is scheduled, but its mutex is locked,
> it's
> >> put
> >> > >>>>> in
> >> > >>>>>> a
> >> > >>>>>>>>> queue specific to that mutex.  The release function for the
> >> mutex
> >> > >>>>>>> (called
> >> > >>>>>>>>> when a Continuation holding the mutex exists) would put the
> >> > >>>>>>> Continuation
> >> > >>>>>>>> at
> >> > >>>>>>>>> the front of the mutex's queue (if not empty) into the
> >> > >>>> ready-to-run
> >> > >>>>>>> queue
> >> > >>>>>>>>> (transferring the lock to that Continuation).  A drawback is
> >> that
> >> > >>>>> the
> >> > >>>>>>>> queue
> >> > >>>>>>>>> would itself need a mutex (spinlock?), but the critical
> >> section
> >> > >>>>> would
> >> > >>>>>>> be
> >> > >>>>>>>>> very short.
> >> > >>>>>>>>>
> >> > >>>>>>>>> There would be a function to lock a mutex directly.  It
> would
> >> > >>>>> create
> >> > >>>>>> a
> >> > >>>>>>>>> Continuation that had two condition variables.  It would
> >> assign
> >> > >>>> the
> >> > >>>>>>> mutex
> >> > >>>>>>>>> to this Continuation and schedule it.  (In this case, it
> might
> >> > >>>> make
> >> > >>>>>>> sense
> >> > >>>>>>>>> to put this Continuation at the front of the mutex's queue,
> >> since
> >> > >>>>> it
> >> > >>>>>>>> would
> >> > >>>>>>>>> be blocking an entire event thread.)  The direct-lock
> function
> >> > >>>>> would
> >> > >>>>>>> then
> >> > >>>>>>>>> block on the first condition variable.  When the
> Continuation
> >> > >>>> ran,
> >> > >>>>> it
> >> > >>>>>>>> would
> >> > >>>>>>>>> trigger the first condition variable, and then block on the
> >> > >>>> second
> >> > >>>>>>>>> condition variable.  The direct-lock function would then
> exit,
> >> > >>>>>> allowing
> >> > >>>>>>>> the
> >> > >>>>>>>>> calling code to enter its critical section.  At the end of
> the
> >> > >>>>>> critical
> >> > >>>>>>>>> section, another function to release the direct lock would
> be
> >> > >>>>> called.
> >> > >>>>>>> It
> >> > >>>>>>>>> would trigger the second condition variable, which would
> cause
> >> > >>>> the
> >> > >>>>>>>> function
> >> > >>>>>>>>> of the Continuation created for the direct lock to exit
> (thus
> >> > >>>>>> releasing
> >> > >>>>>>>> the
> >> > >>>>>>>>> mutex).
> >> > >>>>>>>>>
> >> > >>>>>>>>> With this approach, I'm not sure thread affinities would be
> of
> >> > >>>> any
> >> > >>>>>>> value.
> >> > >>>>>>>>> I think perhaps each core should have it's own list of
> >> > >>>> ready-to-run
> >> > >>>>>>>>> Continuations, and a pool of event threads with affinity to
> >> that
> >> > >>>>>> core.
> >> > >>>>>>>> Not
> >> > >>>>>>>>> having per-event-thread ready-to-run lists means that a
> >> > >>>>> Continuation
> >> > >>>>>>>>> function that blocks is less likely to block other
> >> ready-to-run
> >> > >>>>>>>>> Continuations.  If Continuations had core affinities to some
> >> > >>>>> degree,
> >> > >>>>>>> this
> >> > >>>>>>>>> might reduce evictions in per-core memory cache.  (Multiple
> >> > >>>>>>> Continuations
> >> > >>>>>>>>> having the same function should have the same core
> affinity.)
> >> > >>>>>>>>>
> >> > >>>>>>>>
> >> > >>>>>>>
> >> > >>>>>>
> >> > >>>>>
> >> > >>>>
> >> > >>
> >> > >>
> >> >
> >> >
> >>
> >
>

Re: Tentative proposal: TS mutexes should be queues not OS mutexes

Posted by Walt Karas <wk...@verizonmedia.com.INVALID>.
On Wed, Oct 2, 2019 at 3:19 PM Kees Spoelstra <ks...@we-amp.com> wrote:

> I think we're all able to dig some ratholes with less code than 30 lines.
>

Him, I thought "rathole" implied unneeded complexity but it seems to be
just a synonym for "bad".


>
> Alan, it's only my fault that I pushed for threadaffinity for plugins
> after I saw how it was in core... Can't keep all the goodies for the core
> :)
>
> One thing to watch out for is in order execution of continuations, as I've
> mentioned earlier there is a big possibility that there is code which does
> not properly lock ( or does not need to) and might get executing
> continuations at the same time instead of in order when running
> concurrently.
> Sometimes to get out of the current executing context which might be deep
> in the stack (x locks down and before some cleanup) you will schedule
> another continuation on the same thread, which can give you a cleaner
> context.
> Imagine the scheduler picking that up and running it concurrently on
> another thread on the same core, that might yield weird results.
> Multiple threads might break that implicit thread safety
>

When one Continuation schedules another, it could go into a queue of
Continuations that become runnable after the current Continuation exits.
So the pseudo-code becomes:

bool idle[Num_workers_pre_core];
Condition_variable go[Num_workers_per_core];
Continuation *curr[Num_workers_per_core];
Continuation_queue after[Num_workers_per_core]; // Run after current
continuation.

while (true)
{
  blocking get on message queue;
  switch (msg.type)
  {
    case WORKER_THREAD_READY_FOR_NEXT:  // Sent by worker thread.
      while (after[msg.worker_index} is not empty) {
        c = pop from after[msg.worker_index];
        push c to run queue;
      }
      if (run queue empty) {
        idle[msg.worker_index] = true;
      } else {
        curr[msg.worker_index] = dequeue from run queue;
        trigger go[msg.worker_index];
      }
    break;

    case QUEUE_CONTINUATION_TO_RUN:
      run_index = index where idle[index] is true or none;
      if (run_index == none) {
        enqueue msg.continuation in run queue;
      } else {
        idel[run_index] = false;
        curr[run_index] = msg.continuation;
        trigger go[run_index];
      }
    break;

  } // end switch
} // end while

On the other hand, it's not that hard for a specific Continuation to not
trigger successive Continuations until they are actually ready to run.  It
would simply need to have its own "after" queue.  We should be selective
about how many capabilities we pull into the core Event execution code, and
how much we damage general behavior by doing so.


>
> Locking mostly gets/got screwy in the plugins, because of possible
> thread/core jumping, the plugin affinity functions help a lot with that.
> There are patterns to prevent that, try lock, but they're a bit of a pain
> and I've seen people ignore that.
>
> Some findings we had when doing h2. Thread affinity might not always yield
> the best latency on an almost idle system if the protocols on both sides
> are relatively heavy weight the packing and unpacking will take time and
> one can imagine there is some parallelism which could decrease latency.
> We've thought about controlled handover to other threads, but never made
> any work of it.
>
> I can imagine with quic coming that we might see some thread jumping
> again, haven't been following progress on that one.
>
> On Wed, 2 Oct 2019, 21:33 Walt Karas, <wk...@verizonmedia.com.invalid>
> wrote:
>
>> On Wed, Oct 2, 2019 at 2:25 PM Leif Hedstrom <zw...@apache.org> wrote:
>>
>> >
>> >
>> > > On Oct 2, 2019, at 1:00 PM, Walt Karas <wkaras@verizonmedia.com
>> .INVALID>
>> > wrote:
>> > >
>> > > I feel like you overestimate me if you think I could create a rathole
>> > with
>> > > just 30 lines of pseudo-code.
>> >
>> > We have faith in you.
>> >
>>
>> What's an example of a rathole I've already created?  Seems like a very
>> big
>> leap of faith if there isn't one.
>>
>> Question:  Why do we have more event threads than cores?
>>
>> If it's to mitigate the fact that ATS is not purist event-driven, that
>> Continuations sometimes do block, it makes sense to use the best design to
>> mitigate this.  Sorry but I don't think any big profiling hoedown is
>> needed
>> to see that what I'm proposing mitigates it better.  It's essentially the
>> same idea as having a single line to pay even when there are multiple
>> cashiers.
>>
>>
>> >
>> > >
>> > > We're making complex thread affinity changes, on top of an event
>> > execution
>> > > code that already seem complex.  (I leave it to those with more
>> rathole
>> > > background to say if it should be so labeled.) . Can I see the
>> profiling
>> > > results used to justify and guide the thread affinity design?
>> >
>> >
>> > +1.
>> >
>> > — leif
>> >
>> > >
>> > > On Wed, Oct 2, 2019 at 1:32 PM Leif Hedstrom <zw...@apache.org>
>> wrote:
>> > >
>> > >>
>> > >>
>> > >>> On Oct 2, 2019, at 11:23 AM, Walt Karas <wkaras@verizonmedia.com
>> > .INVALID>
>> > >> wrote:
>> > >>>
>> > >>> What's the best tool for multi-threaded profiling on Linux?
>> > >>
>> > >> Probably the Intel tools.  Probably the “perf” toolchain works well,
>> at
>> > >> least on modern linux, you could do perf lock etc.
>> > >>
>> > >>>
>> > >>> On Wed, Oct 2, 2019 at 10:14 AM Alan Carroll
>> > >>> <so...@verizonmedia.com.invalid> wrote:
>> > >>>
>> > >>>> Correct, it doesn't mean no lock contention. The claim was it
>> reduced
>> > >> the
>> > >>>> lock contention to a level where it's not significant enough to
>> > warrant
>> > >>>> additional preventative measures. The data Dan got wasn't from code
>> > >>>> analysis, but from run time profiling. That was a while ago so if
>> > you'd
>> > >>>> like to perform another pass of measuring the level of lock
>> > contention,
>> > >>>> that would certainly be interesting data.
>> > >>
>> > >>
>> > >> So, before we dig outselves into this rathole, can someone explain
>> to me
>> > >> what the problem is? Where are the metrics / analysis showing that we
>> > have
>> > >> a problem? I’d love to see a comparison too between various version
>> of
>> > ATS,
>> > >> say v6 - v9, and understand where, if anywhere, we made things (lock
>> > >> contention) worse.
>> > >>
>> > >> We have to stop making complicated and invasive solutions without
>> real
>> > >> problems, and understanding such problem.
>> > >>
>> > >> My $0.01,
>> > >>
>> > >> — Leif
>> > >>
>> > >>>>
>> > >>>> In addition, the push for thread affinity started from actual
>> issues
>> > in
>> > >>>> production with Continuations being scheduled on different threads
>> of
>> > >> the
>> > >>>> same type (that is, it was Kees' fault). Those would not be
>> resolved
>> > by
>> > >>>> faster scheduling on different threads.
>> > >>>>
>> > >>>> On Tue, Oct 1, 2019 at 11:49 AM Walt Karas <
>> wkaras@verizonmedia.com
>> > >>>> .invalid>
>> > >>>> wrote:
>> > >>>>
>> > >>>>> I assume thread affinity can't literal mean no lock contention.
>> > You'd
>> > >>>> need
>> > >>>>> a lock on the thread run queue wouldn't you?  Continuations can't
>> > only
>> > >>>> get
>> > >>>>> queued for the event thread from the event thread.  I don't think
>> we
>> > >> can
>> > >>>>> say conclusively that there would be a significant difference due
>> to
>> > >> lock
>> > >>>>> contention.  I'm guessing that Fei would agree that the
>> Continuation
>> > >>>>> dispatch code is difficult to understand and work on.
>> Simplification
>> > >> and
>> > >>>>> more modularity is obviously a goal.  Seems like it would be
>> simpler
>> > if
>> > >>>> all
>> > >>>>> the Continuations in a to-run list where actually ready to run.
>> > >>>>>
>> > >>>>> On Tue, Oct 1, 2019 at 9:22 AM Alan Carroll
>> > >>>>> <so...@verizonmedia.com.invalid> wrote:
>> > >>>>>
>> > >>>>>> Do you have any more specific information on mutex contention? We
>> > have
>> > >>>> in
>> > >>>>>> fact already looked at doing this, I think back in 2015 with Dan
>> Xu.
>> > >>>> The
>> > >>>>>> goal there was to have queues with the mutexes to avoid
>> > rescheduling.
>> > >>>> As
>> > >>>>> a
>> > >>>>>> precursor Dan did some profiling and the only significant lock
>> > >>>> contention
>> > >>>>>> he could find was in the cache. That lead to the partial object
>> > >> caching
>> > >>>>>> work setting up queues for the hot locks, but it was decided that
>> > >> given
>> > >>>>> the
>> > >>>>>> lack of
>> > >>>>>> contention elsewhere, it wasn't worth the complexity.
>> > >>>>>>
>> > >>>>>> I think thread affinity is a better choice because no lock
>> > contention
>> > >>>>> will
>> > >>>>>> always beat even the most optimized lock contention resolution.
>> If
>> > >>>>>> Continuations related to the same constellation of data objects
>> are
>> > on
>> > >>>>> the
>> > >>>>>> same thread, then the locks are never contested, which makes it
>> as
>> > >> fast
>> > >>>>> as
>> > >>>>>> possible.
>> > >>>>>>
>> > >>>>>> On Mon, Sep 30, 2019 at 3:45 PM Walt Karas <
>> wkaras@verizonmedia.com
>> > >>>>>> .invalid>
>> > >>>>>> wrote:
>> > >>>>>>
>> > >>>>>>> From the longer-term TSers I've heard comments about seeing
>> > profiling
>> > >>>>>>> results that show that waiting on mutexes is a significant
>> > >>>> performance
>> > >>>>>>> issue with TS.  But I'm not aware of any write-ups of such
>> results.
>> > >>>>>>> Unfortunately, I'm relatively new to TS and Linux, so I'm not
>> > >>>> currently
>> > >>>>>>> familiar with the best approaches to profiling TS.
>> > >>>>>>>
>> > >>>>>>> For better performance, I think that having a single to-run
>> > >>>>> Continuation
>> > >>>>>>> queue, or one per core, with a queue feeding multiple event
>> threads
>> > >>>> is
>> > >>>>>> the
>> > >>>>>>> main thing.  It's more resilient to Continuations that block.
>> > There
>> > >>>>>>> doesn't seem to be enthusiasm for getting hard-core about not
>> > having
>> > >>>>>>> blocking Continuations (see
>> > >>>>>>> https://github.com/apache/trafficserver/pull/5412 ).  I'm not
>> sure
>> > >>>>>>> changing
>> > >>>>>>> to queue-based mutexes would have a significant performance
>> impact.
>> > >>>>> But
>> > >>>>>> it
>> > >>>>>>> seems a cleaner design, making sure Continuations in the to-run
>> > >>>> list(s)
>> > >>>>>> are
>> > >>>>>>> actually ready to run.  But a different mutex implementation is
>> not
>> > >>>>>>> strictly necessary in order to consolidate to-run Continuation
>> > >>>> queues.
>> > >>>>>>>
>> > >>>>>>> On Mon, Sep 30, 2019 at 2:39 PM Kees Spoelstra <
>> > >>>> kspoelstra@we-amp.com>
>> > >>>>>>> wrote:
>> > >>>>>>>
>> > >>>>>>>> Sounds very interesting.
>> > >>>>>>>> But what is the problem we're trying to solve here, I like the
>> > >>>> thread
>> > >>>>>>>> affinity because it gives us head ache free concurrency in some
>> > >>>>> cases,
>> > >>>>>>> and
>> > >>>>>>>> I'll bet that there is some code which doesn't have the proper
>> > >>>>>>> continuation
>> > >>>>>>>> mutexes because we know it runs on the same thread.
>> > >>>>>>>>
>> > >>>>>>>> Are we seeing a lot of imbalanced threads (too much processing
>> > >>>>> causing
>> > >>>>>>> long
>> > >>>>>>>> queues of continuations, which I can imagine in some cases) ?
>> And
>> > >>>>>>> shouldn't
>> > >>>>>>>> we balance based on transactions or connections, move those
>> around
>> > >>>>> when
>> > >>>>>>> we
>> > >>>>>>>> see imbalance and aim for embarrassingly parallel processing :)
>> > >>>> Come
>> > >>>>>>>> to think of it, this might introduce another set of problems,
>> how
>> > >>>> to
>> > >>>>>> know
>> > >>>>>>>> which continuations are part of the life cycle of a connection
>> :/
>> > >>>>>>>>
>> > >>>>>>>> Jumping threads in one transaction is not always ideal either,
>> > this
>> > >>>>> can
>> > >>>>>>>> really hurt performance. But your proposed model seems to
>> handle
>> > >>>> that
>> > >>>>>>>> somewhat better than the current implementation.
>> > >>>>>>>>
>> > >>>>>>>> Very interested and wondering what this would mean for plugin
>> > >>>>>> developers.
>> > >>>>>>>>
>> > >>>>>>>> On Mon, 30 Sep 2019, 19:20 Walt Karas, <
>> wkaras@verizonmedia.com
>> > >>>>>> .invalid>
>> > >>>>>>>> wrote:
>> > >>>>>>>>
>> > >>>>>>>>> If a Continuation is scheduled, but its mutex is locked, it's
>> put
>> > >>>>> in
>> > >>>>>> a
>> > >>>>>>>>> queue specific to that mutex.  The release function for the
>> mutex
>> > >>>>>>> (called
>> > >>>>>>>>> when a Continuation holding the mutex exists) would put the
>> > >>>>>>> Continuation
>> > >>>>>>>> at
>> > >>>>>>>>> the front of the mutex's queue (if not empty) into the
>> > >>>> ready-to-run
>> > >>>>>>> queue
>> > >>>>>>>>> (transferring the lock to that Continuation).  A drawback is
>> that
>> > >>>>> the
>> > >>>>>>>> queue
>> > >>>>>>>>> would itself need a mutex (spinlock?), but the critical
>> section
>> > >>>>> would
>> > >>>>>>> be
>> > >>>>>>>>> very short.
>> > >>>>>>>>>
>> > >>>>>>>>> There would be a function to lock a mutex directly.  It would
>> > >>>>> create
>> > >>>>>> a
>> > >>>>>>>>> Continuation that had two condition variables.  It would
>> assign
>> > >>>> the
>> > >>>>>>> mutex
>> > >>>>>>>>> to this Continuation and schedule it.  (In this case, it might
>> > >>>> make
>> > >>>>>>> sense
>> > >>>>>>>>> to put this Continuation at the front of the mutex's queue,
>> since
>> > >>>>> it
>> > >>>>>>>> would
>> > >>>>>>>>> be blocking an entire event thread.)  The direct-lock function
>> > >>>>> would
>> > >>>>>>> then
>> > >>>>>>>>> block on the first condition variable.  When the Continuation
>> > >>>> ran,
>> > >>>>> it
>> > >>>>>>>> would
>> > >>>>>>>>> trigger the first condition variable, and then block on the
>> > >>>> second
>> > >>>>>>>>> condition variable.  The direct-lock function would then exit,
>> > >>>>>> allowing
>> > >>>>>>>> the
>> > >>>>>>>>> calling code to enter its critical section.  At the end of the
>> > >>>>>> critical
>> > >>>>>>>>> section, another function to release the direct lock would be
>> > >>>>> called.
>> > >>>>>>> It
>> > >>>>>>>>> would trigger the second condition variable, which would cause
>> > >>>> the
>> > >>>>>>>> function
>> > >>>>>>>>> of the Continuation created for the direct lock to exit (thus
>> > >>>>>> releasing
>> > >>>>>>>> the
>> > >>>>>>>>> mutex).
>> > >>>>>>>>>
>> > >>>>>>>>> With this approach, I'm not sure thread affinities would be of
>> > >>>> any
>> > >>>>>>> value.
>> > >>>>>>>>> I think perhaps each core should have it's own list of
>> > >>>> ready-to-run
>> > >>>>>>>>> Continuations, and a pool of event threads with affinity to
>> that
>> > >>>>>> core.
>> > >>>>>>>> Not
>> > >>>>>>>>> having per-event-thread ready-to-run lists means that a
>> > >>>>> Continuation
>> > >>>>>>>>> function that blocks is less likely to block other
>> ready-to-run
>> > >>>>>>>>> Continuations.  If Continuations had core affinities to some
>> > >>>>> degree,
>> > >>>>>>> this
>> > >>>>>>>>> might reduce evictions in per-core memory cache.  (Multiple
>> > >>>>>>> Continuations
>> > >>>>>>>>> having the same function should have the same core affinity.)
>> > >>>>>>>>>
>> > >>>>>>>>
>> > >>>>>>>
>> > >>>>>>
>> > >>>>>
>> > >>>>
>> > >>
>> > >>
>> >
>> >
>>
>

Re: Tentative proposal: TS mutexes should be queues not OS mutexes

Posted by Kees Spoelstra <ks...@we-amp.com>.
I think we're all able to dig some ratholes with less code than 30 lines.

Alan, it's only my fault that I pushed for threadaffinity for plugins after
I saw how it was in core... Can't keep all the goodies for the core :)

One thing to watch out for is in order execution of continuations, as I've
mentioned earlier there is a big possibility that there is code which does
not properly lock ( or does not need to) and might get executing
continuations at the same time instead of in order when running
concurrently.
Sometimes to get out of the current executing context which might be deep
in the stack (x locks down and before some cleanup) you will schedule
another continuation on the same thread, which can give you a cleaner
context.
Imagine the scheduler picking that up and running it concurrently on
another thread on the same core, that might yield weird results.
Multiple threads might break that implicit thread safety

Locking mostly gets/got screwy in the plugins, because of possible
thread/core jumping, the plugin affinity functions help a lot with that.
There are patterns to prevent that, try lock, but they're a bit of a pain
and I've seen people ignore that.

Some findings we had when doing h2. Thread affinity might not always yield
the best latency on an almost idle system if the protocols on both sides
are relatively heavy weight the packing and unpacking will take time and
one can imagine there is some parallelism which could decrease latency.
We've thought about controlled handover to other threads, but never made
any work of it.

I can imagine with quic coming that we might see some thread jumping again,
haven't been following progress on that one.

On Wed, 2 Oct 2019, 21:33 Walt Karas, <wk...@verizonmedia.com.invalid>
wrote:

> On Wed, Oct 2, 2019 at 2:25 PM Leif Hedstrom <zw...@apache.org> wrote:
>
> >
> >
> > > On Oct 2, 2019, at 1:00 PM, Walt Karas <wkaras@verizonmedia.com
> .INVALID>
> > wrote:
> > >
> > > I feel like you overestimate me if you think I could create a rathole
> > with
> > > just 30 lines of pseudo-code.
> >
> > We have faith in you.
> >
>
> What's an example of a rathole I've already created?  Seems like a very big
> leap of faith if there isn't one.
>
> Question:  Why do we have more event threads than cores?
>
> If it's to mitigate the fact that ATS is not purist event-driven, that
> Continuations sometimes do block, it makes sense to use the best design to
> mitigate this.  Sorry but I don't think any big profiling hoedown is needed
> to see that what I'm proposing mitigates it better.  It's essentially the
> same idea as having a single line to pay even when there are multiple
> cashiers.
>
>
> >
> > >
> > > We're making complex thread affinity changes, on top of an event
> > execution
> > > code that already seem complex.  (I leave it to those with more rathole
> > > background to say if it should be so labeled.) . Can I see the
> profiling
> > > results used to justify and guide the thread affinity design?
> >
> >
> > +1.
> >
> > — leif
> >
> > >
> > > On Wed, Oct 2, 2019 at 1:32 PM Leif Hedstrom <zw...@apache.org> wrote:
> > >
> > >>
> > >>
> > >>> On Oct 2, 2019, at 11:23 AM, Walt Karas <wkaras@verizonmedia.com
> > .INVALID>
> > >> wrote:
> > >>>
> > >>> What's the best tool for multi-threaded profiling on Linux?
> > >>
> > >> Probably the Intel tools.  Probably the “perf” toolchain works well,
> at
> > >> least on modern linux, you could do perf lock etc.
> > >>
> > >>>
> > >>> On Wed, Oct 2, 2019 at 10:14 AM Alan Carroll
> > >>> <so...@verizonmedia.com.invalid> wrote:
> > >>>
> > >>>> Correct, it doesn't mean no lock contention. The claim was it
> reduced
> > >> the
> > >>>> lock contention to a level where it's not significant enough to
> > warrant
> > >>>> additional preventative measures. The data Dan got wasn't from code
> > >>>> analysis, but from run time profiling. That was a while ago so if
> > you'd
> > >>>> like to perform another pass of measuring the level of lock
> > contention,
> > >>>> that would certainly be interesting data.
> > >>
> > >>
> > >> So, before we dig outselves into this rathole, can someone explain to
> me
> > >> what the problem is? Where are the metrics / analysis showing that we
> > have
> > >> a problem? I’d love to see a comparison too between various version of
> > ATS,
> > >> say v6 - v9, and understand where, if anywhere, we made things (lock
> > >> contention) worse.
> > >>
> > >> We have to stop making complicated and invasive solutions without real
> > >> problems, and understanding such problem.
> > >>
> > >> My $0.01,
> > >>
> > >> — Leif
> > >>
> > >>>>
> > >>>> In addition, the push for thread affinity started from actual issues
> > in
> > >>>> production with Continuations being scheduled on different threads
> of
> > >> the
> > >>>> same type (that is, it was Kees' fault). Those would not be resolved
> > by
> > >>>> faster scheduling on different threads.
> > >>>>
> > >>>> On Tue, Oct 1, 2019 at 11:49 AM Walt Karas <wkaras@verizonmedia.com
> > >>>> .invalid>
> > >>>> wrote:
> > >>>>
> > >>>>> I assume thread affinity can't literal mean no lock contention.
> > You'd
> > >>>> need
> > >>>>> a lock on the thread run queue wouldn't you?  Continuations can't
> > only
> > >>>> get
> > >>>>> queued for the event thread from the event thread.  I don't think
> we
> > >> can
> > >>>>> say conclusively that there would be a significant difference due
> to
> > >> lock
> > >>>>> contention.  I'm guessing that Fei would agree that the
> Continuation
> > >>>>> dispatch code is difficult to understand and work on.
> Simplification
> > >> and
> > >>>>> more modularity is obviously a goal.  Seems like it would be
> simpler
> > if
> > >>>> all
> > >>>>> the Continuations in a to-run list where actually ready to run.
> > >>>>>
> > >>>>> On Tue, Oct 1, 2019 at 9:22 AM Alan Carroll
> > >>>>> <so...@verizonmedia.com.invalid> wrote:
> > >>>>>
> > >>>>>> Do you have any more specific information on mutex contention? We
> > have
> > >>>> in
> > >>>>>> fact already looked at doing this, I think back in 2015 with Dan
> Xu.
> > >>>> The
> > >>>>>> goal there was to have queues with the mutexes to avoid
> > rescheduling.
> > >>>> As
> > >>>>> a
> > >>>>>> precursor Dan did some profiling and the only significant lock
> > >>>> contention
> > >>>>>> he could find was in the cache. That lead to the partial object
> > >> caching
> > >>>>>> work setting up queues for the hot locks, but it was decided that
> > >> given
> > >>>>> the
> > >>>>>> lack of
> > >>>>>> contention elsewhere, it wasn't worth the complexity.
> > >>>>>>
> > >>>>>> I think thread affinity is a better choice because no lock
> > contention
> > >>>>> will
> > >>>>>> always beat even the most optimized lock contention resolution. If
> > >>>>>> Continuations related to the same constellation of data objects
> are
> > on
> > >>>>> the
> > >>>>>> same thread, then the locks are never contested, which makes it as
> > >> fast
> > >>>>> as
> > >>>>>> possible.
> > >>>>>>
> > >>>>>> On Mon, Sep 30, 2019 at 3:45 PM Walt Karas <
> wkaras@verizonmedia.com
> > >>>>>> .invalid>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> From the longer-term TSers I've heard comments about seeing
> > profiling
> > >>>>>>> results that show that waiting on mutexes is a significant
> > >>>> performance
> > >>>>>>> issue with TS.  But I'm not aware of any write-ups of such
> results.
> > >>>>>>> Unfortunately, I'm relatively new to TS and Linux, so I'm not
> > >>>> currently
> > >>>>>>> familiar with the best approaches to profiling TS.
> > >>>>>>>
> > >>>>>>> For better performance, I think that having a single to-run
> > >>>>> Continuation
> > >>>>>>> queue, or one per core, with a queue feeding multiple event
> threads
> > >>>> is
> > >>>>>> the
> > >>>>>>> main thing.  It's more resilient to Continuations that block.
> > There
> > >>>>>>> doesn't seem to be enthusiasm for getting hard-core about not
> > having
> > >>>>>>> blocking Continuations (see
> > >>>>>>> https://github.com/apache/trafficserver/pull/5412 ).  I'm not
> sure
> > >>>>>>> changing
> > >>>>>>> to queue-based mutexes would have a significant performance
> impact.
> > >>>>> But
> > >>>>>> it
> > >>>>>>> seems a cleaner design, making sure Continuations in the to-run
> > >>>> list(s)
> > >>>>>> are
> > >>>>>>> actually ready to run.  But a different mutex implementation is
> not
> > >>>>>>> strictly necessary in order to consolidate to-run Continuation
> > >>>> queues.
> > >>>>>>>
> > >>>>>>> On Mon, Sep 30, 2019 at 2:39 PM Kees Spoelstra <
> > >>>> kspoelstra@we-amp.com>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Sounds very interesting.
> > >>>>>>>> But what is the problem we're trying to solve here, I like the
> > >>>> thread
> > >>>>>>>> affinity because it gives us head ache free concurrency in some
> > >>>>> cases,
> > >>>>>>> and
> > >>>>>>>> I'll bet that there is some code which doesn't have the proper
> > >>>>>>> continuation
> > >>>>>>>> mutexes because we know it runs on the same thread.
> > >>>>>>>>
> > >>>>>>>> Are we seeing a lot of imbalanced threads (too much processing
> > >>>>> causing
> > >>>>>>> long
> > >>>>>>>> queues of continuations, which I can imagine in some cases) ?
> And
> > >>>>>>> shouldn't
> > >>>>>>>> we balance based on transactions or connections, move those
> around
> > >>>>> when
> > >>>>>>> we
> > >>>>>>>> see imbalance and aim for embarrassingly parallel processing :)
> > >>>> Come
> > >>>>>>>> to think of it, this might introduce another set of problems,
> how
> > >>>> to
> > >>>>>> know
> > >>>>>>>> which continuations are part of the life cycle of a connection
> :/
> > >>>>>>>>
> > >>>>>>>> Jumping threads in one transaction is not always ideal either,
> > this
> > >>>>> can
> > >>>>>>>> really hurt performance. But your proposed model seems to handle
> > >>>> that
> > >>>>>>>> somewhat better than the current implementation.
> > >>>>>>>>
> > >>>>>>>> Very interested and wondering what this would mean for plugin
> > >>>>>> developers.
> > >>>>>>>>
> > >>>>>>>> On Mon, 30 Sep 2019, 19:20 Walt Karas, <wkaras@verizonmedia.com
> > >>>>>> .invalid>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> If a Continuation is scheduled, but its mutex is locked, it's
> put
> > >>>>> in
> > >>>>>> a
> > >>>>>>>>> queue specific to that mutex.  The release function for the
> mutex
> > >>>>>>> (called
> > >>>>>>>>> when a Continuation holding the mutex exists) would put the
> > >>>>>>> Continuation
> > >>>>>>>> at
> > >>>>>>>>> the front of the mutex's queue (if not empty) into the
> > >>>> ready-to-run
> > >>>>>>> queue
> > >>>>>>>>> (transferring the lock to that Continuation).  A drawback is
> that
> > >>>>> the
> > >>>>>>>> queue
> > >>>>>>>>> would itself need a mutex (spinlock?), but the critical section
> > >>>>> would
> > >>>>>>> be
> > >>>>>>>>> very short.
> > >>>>>>>>>
> > >>>>>>>>> There would be a function to lock a mutex directly.  It would
> > >>>>> create
> > >>>>>> a
> > >>>>>>>>> Continuation that had two condition variables.  It would assign
> > >>>> the
> > >>>>>>> mutex
> > >>>>>>>>> to this Continuation and schedule it.  (In this case, it might
> > >>>> make
> > >>>>>>> sense
> > >>>>>>>>> to put this Continuation at the front of the mutex's queue,
> since
> > >>>>> it
> > >>>>>>>> would
> > >>>>>>>>> be blocking an entire event thread.)  The direct-lock function
> > >>>>> would
> > >>>>>>> then
> > >>>>>>>>> block on the first condition variable.  When the Continuation
> > >>>> ran,
> > >>>>> it
> > >>>>>>>> would
> > >>>>>>>>> trigger the first condition variable, and then block on the
> > >>>> second
> > >>>>>>>>> condition variable.  The direct-lock function would then exit,
> > >>>>>> allowing
> > >>>>>>>> the
> > >>>>>>>>> calling code to enter its critical section.  At the end of the
> > >>>>>> critical
> > >>>>>>>>> section, another function to release the direct lock would be
> > >>>>> called.
> > >>>>>>> It
> > >>>>>>>>> would trigger the second condition variable, which would cause
> > >>>> the
> > >>>>>>>> function
> > >>>>>>>>> of the Continuation created for the direct lock to exit (thus
> > >>>>>> releasing
> > >>>>>>>> the
> > >>>>>>>>> mutex).
> > >>>>>>>>>
> > >>>>>>>>> With this approach, I'm not sure thread affinities would be of
> > >>>> any
> > >>>>>>> value.
> > >>>>>>>>> I think perhaps each core should have it's own list of
> > >>>> ready-to-run
> > >>>>>>>>> Continuations, and a pool of event threads with affinity to
> that
> > >>>>>> core.
> > >>>>>>>> Not
> > >>>>>>>>> having per-event-thread ready-to-run lists means that a
> > >>>>> Continuation
> > >>>>>>>>> function that blocks is less likely to block other ready-to-run
> > >>>>>>>>> Continuations.  If Continuations had core affinities to some
> > >>>>> degree,
> > >>>>>>> this
> > >>>>>>>>> might reduce evictions in per-core memory cache.  (Multiple
> > >>>>>>> Continuations
> > >>>>>>>>> having the same function should have the same core affinity.)
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>
> > >>
> >
> >
>

Re: Tentative proposal: TS mutexes should be queues not OS mutexes

Posted by Walt Karas <wk...@verizonmedia.com.INVALID>.
On Wed, Oct 2, 2019 at 2:25 PM Leif Hedstrom <zw...@apache.org> wrote:

>
>
> > On Oct 2, 2019, at 1:00 PM, Walt Karas <wk...@verizonmedia.com.INVALID>
> wrote:
> >
> > I feel like you overestimate me if you think I could create a rathole
> with
> > just 30 lines of pseudo-code.
>
> We have faith in you.
>

What's an example of a rathole I've already created?  Seems like a very big
leap of faith if there isn't one.

Question:  Why do we have more event threads than cores?

If it's to mitigate the fact that ATS is not purist event-driven, that
Continuations sometimes do block, it makes sense to use the best design to
mitigate this.  Sorry but I don't think any big profiling hoedown is needed
to see that what I'm proposing mitigates it better.  It's essentially the
same idea as having a single line to pay even when there are multiple
cashiers.


>
> >
> > We're making complex thread affinity changes, on top of an event
> execution
> > code that already seem complex.  (I leave it to those with more rathole
> > background to say if it should be so labeled.) . Can I see the profiling
> > results used to justify and guide the thread affinity design?
>
>
> +1.
>
> — leif
>
> >
> > On Wed, Oct 2, 2019 at 1:32 PM Leif Hedstrom <zw...@apache.org> wrote:
> >
> >>
> >>
> >>> On Oct 2, 2019, at 11:23 AM, Walt Karas <wkaras@verizonmedia.com
> .INVALID>
> >> wrote:
> >>>
> >>> What's the best tool for multi-threaded profiling on Linux?
> >>
> >> Probably the Intel tools.  Probably the “perf” toolchain works well, at
> >> least on modern linux, you could do perf lock etc.
> >>
> >>>
> >>> On Wed, Oct 2, 2019 at 10:14 AM Alan Carroll
> >>> <so...@verizonmedia.com.invalid> wrote:
> >>>
> >>>> Correct, it doesn't mean no lock contention. The claim was it reduced
> >> the
> >>>> lock contention to a level where it's not significant enough to
> warrant
> >>>> additional preventative measures. The data Dan got wasn't from code
> >>>> analysis, but from run time profiling. That was a while ago so if
> you'd
> >>>> like to perform another pass of measuring the level of lock
> contention,
> >>>> that would certainly be interesting data.
> >>
> >>
> >> So, before we dig outselves into this rathole, can someone explain to me
> >> what the problem is? Where are the metrics / analysis showing that we
> have
> >> a problem? I’d love to see a comparison too between various version of
> ATS,
> >> say v6 - v9, and understand where, if anywhere, we made things (lock
> >> contention) worse.
> >>
> >> We have to stop making complicated and invasive solutions without real
> >> problems, and understanding such problem.
> >>
> >> My $0.01,
> >>
> >> — Leif
> >>
> >>>>
> >>>> In addition, the push for thread affinity started from actual issues
> in
> >>>> production with Continuations being scheduled on different threads of
> >> the
> >>>> same type (that is, it was Kees' fault). Those would not be resolved
> by
> >>>> faster scheduling on different threads.
> >>>>
> >>>> On Tue, Oct 1, 2019 at 11:49 AM Walt Karas <wkaras@verizonmedia.com
> >>>> .invalid>
> >>>> wrote:
> >>>>
> >>>>> I assume thread affinity can't literal mean no lock contention.
> You'd
> >>>> need
> >>>>> a lock on the thread run queue wouldn't you?  Continuations can't
> only
> >>>> get
> >>>>> queued for the event thread from the event thread.  I don't think we
> >> can
> >>>>> say conclusively that there would be a significant difference due to
> >> lock
> >>>>> contention.  I'm guessing that Fei would agree that the Continuation
> >>>>> dispatch code is difficult to understand and work on.  Simplification
> >> and
> >>>>> more modularity is obviously a goal.  Seems like it would be simpler
> if
> >>>> all
> >>>>> the Continuations in a to-run list where actually ready to run.
> >>>>>
> >>>>> On Tue, Oct 1, 2019 at 9:22 AM Alan Carroll
> >>>>> <so...@verizonmedia.com.invalid> wrote:
> >>>>>
> >>>>>> Do you have any more specific information on mutex contention? We
> have
> >>>> in
> >>>>>> fact already looked at doing this, I think back in 2015 with Dan Xu.
> >>>> The
> >>>>>> goal there was to have queues with the mutexes to avoid
> rescheduling.
> >>>> As
> >>>>> a
> >>>>>> precursor Dan did some profiling and the only significant lock
> >>>> contention
> >>>>>> he could find was in the cache. That lead to the partial object
> >> caching
> >>>>>> work setting up queues for the hot locks, but it was decided that
> >> given
> >>>>> the
> >>>>>> lack of
> >>>>>> contention elsewhere, it wasn't worth the complexity.
> >>>>>>
> >>>>>> I think thread affinity is a better choice because no lock
> contention
> >>>>> will
> >>>>>> always beat even the most optimized lock contention resolution. If
> >>>>>> Continuations related to the same constellation of data objects are
> on
> >>>>> the
> >>>>>> same thread, then the locks are never contested, which makes it as
> >> fast
> >>>>> as
> >>>>>> possible.
> >>>>>>
> >>>>>> On Mon, Sep 30, 2019 at 3:45 PM Walt Karas <wkaras@verizonmedia.com
> >>>>>> .invalid>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> From the longer-term TSers I've heard comments about seeing
> profiling
> >>>>>>> results that show that waiting on mutexes is a significant
> >>>> performance
> >>>>>>> issue with TS.  But I'm not aware of any write-ups of such results.
> >>>>>>> Unfortunately, I'm relatively new to TS and Linux, so I'm not
> >>>> currently
> >>>>>>> familiar with the best approaches to profiling TS.
> >>>>>>>
> >>>>>>> For better performance, I think that having a single to-run
> >>>>> Continuation
> >>>>>>> queue, or one per core, with a queue feeding multiple event threads
> >>>> is
> >>>>>> the
> >>>>>>> main thing.  It's more resilient to Continuations that block.
> There
> >>>>>>> doesn't seem to be enthusiasm for getting hard-core about not
> having
> >>>>>>> blocking Continuations (see
> >>>>>>> https://github.com/apache/trafficserver/pull/5412 ).  I'm not sure
> >>>>>>> changing
> >>>>>>> to queue-based mutexes would have a significant performance impact.
> >>>>> But
> >>>>>> it
> >>>>>>> seems a cleaner design, making sure Continuations in the to-run
> >>>> list(s)
> >>>>>> are
> >>>>>>> actually ready to run.  But a different mutex implementation is not
> >>>>>>> strictly necessary in order to consolidate to-run Continuation
> >>>> queues.
> >>>>>>>
> >>>>>>> On Mon, Sep 30, 2019 at 2:39 PM Kees Spoelstra <
> >>>> kspoelstra@we-amp.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Sounds very interesting.
> >>>>>>>> But what is the problem we're trying to solve here, I like the
> >>>> thread
> >>>>>>>> affinity because it gives us head ache free concurrency in some
> >>>>> cases,
> >>>>>>> and
> >>>>>>>> I'll bet that there is some code which doesn't have the proper
> >>>>>>> continuation
> >>>>>>>> mutexes because we know it runs on the same thread.
> >>>>>>>>
> >>>>>>>> Are we seeing a lot of imbalanced threads (too much processing
> >>>>> causing
> >>>>>>> long
> >>>>>>>> queues of continuations, which I can imagine in some cases) ? And
> >>>>>>> shouldn't
> >>>>>>>> we balance based on transactions or connections, move those around
> >>>>> when
> >>>>>>> we
> >>>>>>>> see imbalance and aim for embarrassingly parallel processing :)
> >>>> Come
> >>>>>>>> to think of it, this might introduce another set of problems, how
> >>>> to
> >>>>>> know
> >>>>>>>> which continuations are part of the life cycle of a connection :/
> >>>>>>>>
> >>>>>>>> Jumping threads in one transaction is not always ideal either,
> this
> >>>>> can
> >>>>>>>> really hurt performance. But your proposed model seems to handle
> >>>> that
> >>>>>>>> somewhat better than the current implementation.
> >>>>>>>>
> >>>>>>>> Very interested and wondering what this would mean for plugin
> >>>>>> developers.
> >>>>>>>>
> >>>>>>>> On Mon, 30 Sep 2019, 19:20 Walt Karas, <wkaras@verizonmedia.com
> >>>>>> .invalid>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> If a Continuation is scheduled, but its mutex is locked, it's put
> >>>>> in
> >>>>>> a
> >>>>>>>>> queue specific to that mutex.  The release function for the mutex
> >>>>>>> (called
> >>>>>>>>> when a Continuation holding the mutex exists) would put the
> >>>>>>> Continuation
> >>>>>>>> at
> >>>>>>>>> the front of the mutex's queue (if not empty) into the
> >>>> ready-to-run
> >>>>>>> queue
> >>>>>>>>> (transferring the lock to that Continuation).  A drawback is that
> >>>>> the
> >>>>>>>> queue
> >>>>>>>>> would itself need a mutex (spinlock?), but the critical section
> >>>>> would
> >>>>>>> be
> >>>>>>>>> very short.
> >>>>>>>>>
> >>>>>>>>> There would be a function to lock a mutex directly.  It would
> >>>>> create
> >>>>>> a
> >>>>>>>>> Continuation that had two condition variables.  It would assign
> >>>> the
> >>>>>>> mutex
> >>>>>>>>> to this Continuation and schedule it.  (In this case, it might
> >>>> make
> >>>>>>> sense
> >>>>>>>>> to put this Continuation at the front of the mutex's queue, since
> >>>>> it
> >>>>>>>> would
> >>>>>>>>> be blocking an entire event thread.)  The direct-lock function
> >>>>> would
> >>>>>>> then
> >>>>>>>>> block on the first condition variable.  When the Continuation
> >>>> ran,
> >>>>> it
> >>>>>>>> would
> >>>>>>>>> trigger the first condition variable, and then block on the
> >>>> second
> >>>>>>>>> condition variable.  The direct-lock function would then exit,
> >>>>>> allowing
> >>>>>>>> the
> >>>>>>>>> calling code to enter its critical section.  At the end of the
> >>>>>> critical
> >>>>>>>>> section, another function to release the direct lock would be
> >>>>> called.
> >>>>>>> It
> >>>>>>>>> would trigger the second condition variable, which would cause
> >>>> the
> >>>>>>>> function
> >>>>>>>>> of the Continuation created for the direct lock to exit (thus
> >>>>>> releasing
> >>>>>>>> the
> >>>>>>>>> mutex).
> >>>>>>>>>
> >>>>>>>>> With this approach, I'm not sure thread affinities would be of
> >>>> any
> >>>>>>> value.
> >>>>>>>>> I think perhaps each core should have it's own list of
> >>>> ready-to-run
> >>>>>>>>> Continuations, and a pool of event threads with affinity to that
> >>>>>> core.
> >>>>>>>> Not
> >>>>>>>>> having per-event-thread ready-to-run lists means that a
> >>>>> Continuation
> >>>>>>>>> function that blocks is less likely to block other ready-to-run
> >>>>>>>>> Continuations.  If Continuations had core affinities to some
> >>>>> degree,
> >>>>>>> this
> >>>>>>>>> might reduce evictions in per-core memory cache.  (Multiple
> >>>>>>> Continuations
> >>>>>>>>> having the same function should have the same core affinity.)
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>
> >>
>
>

Re: Tentative proposal: TS mutexes should be queues not OS mutexes

Posted by Leif Hedstrom <zw...@apache.org>.

> On Oct 2, 2019, at 1:00 PM, Walt Karas <wk...@verizonmedia.com.INVALID> wrote:
> 
> I feel like you overestimate me if you think I could create a rathole with
> just 30 lines of pseudo-code.

We have faith in you.

> 
> We're making complex thread affinity changes, on top of an event execution
> code that already seem complex.  (I leave it to those with more rathole
> background to say if it should be so labeled.) . Can I see the profiling
> results used to justify and guide the thread affinity design?


+1.

— leif

> 
> On Wed, Oct 2, 2019 at 1:32 PM Leif Hedstrom <zw...@apache.org> wrote:
> 
>> 
>> 
>>> On Oct 2, 2019, at 11:23 AM, Walt Karas <wk...@verizonmedia.com.INVALID>
>> wrote:
>>> 
>>> What's the best tool for multi-threaded profiling on Linux?
>> 
>> Probably the Intel tools.  Probably the “perf” toolchain works well, at
>> least on modern linux, you could do perf lock etc.
>> 
>>> 
>>> On Wed, Oct 2, 2019 at 10:14 AM Alan Carroll
>>> <so...@verizonmedia.com.invalid> wrote:
>>> 
>>>> Correct, it doesn't mean no lock contention. The claim was it reduced
>> the
>>>> lock contention to a level where it's not significant enough to warrant
>>>> additional preventative measures. The data Dan got wasn't from code
>>>> analysis, but from run time profiling. That was a while ago so if you'd
>>>> like to perform another pass of measuring the level of lock contention,
>>>> that would certainly be interesting data.
>> 
>> 
>> So, before we dig outselves into this rathole, can someone explain to me
>> what the problem is? Where are the metrics / analysis showing that we have
>> a problem? I’d love to see a comparison too between various version of ATS,
>> say v6 - v9, and understand where, if anywhere, we made things (lock
>> contention) worse.
>> 
>> We have to stop making complicated and invasive solutions without real
>> problems, and understanding such problem.
>> 
>> My $0.01,
>> 
>> — Leif
>> 
>>>> 
>>>> In addition, the push for thread affinity started from actual issues in
>>>> production with Continuations being scheduled on different threads of
>> the
>>>> same type (that is, it was Kees' fault). Those would not be resolved by
>>>> faster scheduling on different threads.
>>>> 
>>>> On Tue, Oct 1, 2019 at 11:49 AM Walt Karas <wkaras@verizonmedia.com
>>>> .invalid>
>>>> wrote:
>>>> 
>>>>> I assume thread affinity can't literal mean no lock contention.  You'd
>>>> need
>>>>> a lock on the thread run queue wouldn't you?  Continuations can't only
>>>> get
>>>>> queued for the event thread from the event thread.  I don't think we
>> can
>>>>> say conclusively that there would be a significant difference due to
>> lock
>>>>> contention.  I'm guessing that Fei would agree that the Continuation
>>>>> dispatch code is difficult to understand and work on.  Simplification
>> and
>>>>> more modularity is obviously a goal.  Seems like it would be simpler if
>>>> all
>>>>> the Continuations in a to-run list where actually ready to run.
>>>>> 
>>>>> On Tue, Oct 1, 2019 at 9:22 AM Alan Carroll
>>>>> <so...@verizonmedia.com.invalid> wrote:
>>>>> 
>>>>>> Do you have any more specific information on mutex contention? We have
>>>> in
>>>>>> fact already looked at doing this, I think back in 2015 with Dan Xu.
>>>> The
>>>>>> goal there was to have queues with the mutexes to avoid rescheduling.
>>>> As
>>>>> a
>>>>>> precursor Dan did some profiling and the only significant lock
>>>> contention
>>>>>> he could find was in the cache. That lead to the partial object
>> caching
>>>>>> work setting up queues for the hot locks, but it was decided that
>> given
>>>>> the
>>>>>> lack of
>>>>>> contention elsewhere, it wasn't worth the complexity.
>>>>>> 
>>>>>> I think thread affinity is a better choice because no lock contention
>>>>> will
>>>>>> always beat even the most optimized lock contention resolution. If
>>>>>> Continuations related to the same constellation of data objects are on
>>>>> the
>>>>>> same thread, then the locks are never contested, which makes it as
>> fast
>>>>> as
>>>>>> possible.
>>>>>> 
>>>>>> On Mon, Sep 30, 2019 at 3:45 PM Walt Karas <wkaras@verizonmedia.com
>>>>>> .invalid>
>>>>>> wrote:
>>>>>> 
>>>>>>> From the longer-term TSers I've heard comments about seeing profiling
>>>>>>> results that show that waiting on mutexes is a significant
>>>> performance
>>>>>>> issue with TS.  But I'm not aware of any write-ups of such results.
>>>>>>> Unfortunately, I'm relatively new to TS and Linux, so I'm not
>>>> currently
>>>>>>> familiar with the best approaches to profiling TS.
>>>>>>> 
>>>>>>> For better performance, I think that having a single to-run
>>>>> Continuation
>>>>>>> queue, or one per core, with a queue feeding multiple event threads
>>>> is
>>>>>> the
>>>>>>> main thing.  It's more resilient to Continuations that block.  There
>>>>>>> doesn't seem to be enthusiasm for getting hard-core about not having
>>>>>>> blocking Continuations (see
>>>>>>> https://github.com/apache/trafficserver/pull/5412 ).  I'm not sure
>>>>>>> changing
>>>>>>> to queue-based mutexes would have a significant performance impact.
>>>>> But
>>>>>> it
>>>>>>> seems a cleaner design, making sure Continuations in the to-run
>>>> list(s)
>>>>>> are
>>>>>>> actually ready to run.  But a different mutex implementation is not
>>>>>>> strictly necessary in order to consolidate to-run Continuation
>>>> queues.
>>>>>>> 
>>>>>>> On Mon, Sep 30, 2019 at 2:39 PM Kees Spoelstra <
>>>> kspoelstra@we-amp.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Sounds very interesting.
>>>>>>>> But what is the problem we're trying to solve here, I like the
>>>> thread
>>>>>>>> affinity because it gives us head ache free concurrency in some
>>>>> cases,
>>>>>>> and
>>>>>>>> I'll bet that there is some code which doesn't have the proper
>>>>>>> continuation
>>>>>>>> mutexes because we know it runs on the same thread.
>>>>>>>> 
>>>>>>>> Are we seeing a lot of imbalanced threads (too much processing
>>>>> causing
>>>>>>> long
>>>>>>>> queues of continuations, which I can imagine in some cases) ? And
>>>>>>> shouldn't
>>>>>>>> we balance based on transactions or connections, move those around
>>>>> when
>>>>>>> we
>>>>>>>> see imbalance and aim for embarrassingly parallel processing :)
>>>> Come
>>>>>>>> to think of it, this might introduce another set of problems, how
>>>> to
>>>>>> know
>>>>>>>> which continuations are part of the life cycle of a connection :/
>>>>>>>> 
>>>>>>>> Jumping threads in one transaction is not always ideal either, this
>>>>> can
>>>>>>>> really hurt performance. But your proposed model seems to handle
>>>> that
>>>>>>>> somewhat better than the current implementation.
>>>>>>>> 
>>>>>>>> Very interested and wondering what this would mean for plugin
>>>>>> developers.
>>>>>>>> 
>>>>>>>> On Mon, 30 Sep 2019, 19:20 Walt Karas, <wkaras@verizonmedia.com
>>>>>> .invalid>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> If a Continuation is scheduled, but its mutex is locked, it's put
>>>>> in
>>>>>> a
>>>>>>>>> queue specific to that mutex.  The release function for the mutex
>>>>>>> (called
>>>>>>>>> when a Continuation holding the mutex exists) would put the
>>>>>>> Continuation
>>>>>>>> at
>>>>>>>>> the front of the mutex's queue (if not empty) into the
>>>> ready-to-run
>>>>>>> queue
>>>>>>>>> (transferring the lock to that Continuation).  A drawback is that
>>>>> the
>>>>>>>> queue
>>>>>>>>> would itself need a mutex (spinlock?), but the critical section
>>>>> would
>>>>>>> be
>>>>>>>>> very short.
>>>>>>>>> 
>>>>>>>>> There would be a function to lock a mutex directly.  It would
>>>>> create
>>>>>> a
>>>>>>>>> Continuation that had two condition variables.  It would assign
>>>> the
>>>>>>> mutex
>>>>>>>>> to this Continuation and schedule it.  (In this case, it might
>>>> make
>>>>>>> sense
>>>>>>>>> to put this Continuation at the front of the mutex's queue, since
>>>>> it
>>>>>>>> would
>>>>>>>>> be blocking an entire event thread.)  The direct-lock function
>>>>> would
>>>>>>> then
>>>>>>>>> block on the first condition variable.  When the Continuation
>>>> ran,
>>>>> it
>>>>>>>> would
>>>>>>>>> trigger the first condition variable, and then block on the
>>>> second
>>>>>>>>> condition variable.  The direct-lock function would then exit,
>>>>>> allowing
>>>>>>>> the
>>>>>>>>> calling code to enter its critical section.  At the end of the
>>>>>> critical
>>>>>>>>> section, another function to release the direct lock would be
>>>>> called.
>>>>>>> It
>>>>>>>>> would trigger the second condition variable, which would cause
>>>> the
>>>>>>>> function
>>>>>>>>> of the Continuation created for the direct lock to exit (thus
>>>>>> releasing
>>>>>>>> the
>>>>>>>>> mutex).
>>>>>>>>> 
>>>>>>>>> With this approach, I'm not sure thread affinities would be of
>>>> any
>>>>>>> value.
>>>>>>>>> I think perhaps each core should have it's own list of
>>>> ready-to-run
>>>>>>>>> Continuations, and a pool of event threads with affinity to that
>>>>>> core.
>>>>>>>> Not
>>>>>>>>> having per-event-thread ready-to-run lists means that a
>>>>> Continuation
>>>>>>>>> function that blocks is less likely to block other ready-to-run
>>>>>>>>> Continuations.  If Continuations had core affinities to some
>>>>> degree,
>>>>>>> this
>>>>>>>>> might reduce evictions in per-core memory cache.  (Multiple
>>>>>>> Continuations
>>>>>>>>> having the same function should have the same core affinity.)
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> 
>> 


Re: Tentative proposal: TS mutexes should be queues not OS mutexes

Posted by Walt Karas <wk...@verizonmedia.com.INVALID>.
I feel like you overestimate me if you think I could create a rathole with
just 30 lines of pseudo-code.

We're making complex thread affinity changes, on top of an event execution
code that already seem complex.  (I leave it to those with more rathole
background to say if it should be so labeled.) . Can I see the profiling
results used to justify and guide the thread affinity design?

On Wed, Oct 2, 2019 at 1:32 PM Leif Hedstrom <zw...@apache.org> wrote:

>
>
> > On Oct 2, 2019, at 11:23 AM, Walt Karas <wk...@verizonmedia.com.INVALID>
> wrote:
> >
> > What's the best tool for multi-threaded profiling on Linux?
>
> Probably the Intel tools.  Probably the “perf” toolchain works well, at
> least on modern linux, you could do perf lock etc.
>
> >
> > On Wed, Oct 2, 2019 at 10:14 AM Alan Carroll
> > <so...@verizonmedia.com.invalid> wrote:
> >
> >> Correct, it doesn't mean no lock contention. The claim was it reduced
> the
> >> lock contention to a level where it's not significant enough to warrant
> >> additional preventative measures. The data Dan got wasn't from code
> >> analysis, but from run time profiling. That was a while ago so if you'd
> >> like to perform another pass of measuring the level of lock contention,
> >> that would certainly be interesting data.
>
>
> So, before we dig outselves into this rathole, can someone explain to me
> what the problem is? Where are the metrics / analysis showing that we have
> a problem? I’d love to see a comparison too between various version of ATS,
> say v6 - v9, and understand where, if anywhere, we made things (lock
> contention) worse.
>
> We have to stop making complicated and invasive solutions without real
> problems, and understanding such problem.
>
> My $0.01,
>
> — Leif
>
> >>
> >> In addition, the push for thread affinity started from actual issues in
> >> production with Continuations being scheduled on different threads of
> the
> >> same type (that is, it was Kees' fault). Those would not be resolved by
> >> faster scheduling on different threads.
> >>
> >> On Tue, Oct 1, 2019 at 11:49 AM Walt Karas <wkaras@verizonmedia.com
> >> .invalid>
> >> wrote:
> >>
> >>> I assume thread affinity can't literal mean no lock contention.  You'd
> >> need
> >>> a lock on the thread run queue wouldn't you?  Continuations can't only
> >> get
> >>> queued for the event thread from the event thread.  I don't think we
> can
> >>> say conclusively that there would be a significant difference due to
> lock
> >>> contention.  I'm guessing that Fei would agree that the Continuation
> >>> dispatch code is difficult to understand and work on.  Simplification
> and
> >>> more modularity is obviously a goal.  Seems like it would be simpler if
> >> all
> >>> the Continuations in a to-run list where actually ready to run.
> >>>
> >>> On Tue, Oct 1, 2019 at 9:22 AM Alan Carroll
> >>> <so...@verizonmedia.com.invalid> wrote:
> >>>
> >>>> Do you have any more specific information on mutex contention? We have
> >> in
> >>>> fact already looked at doing this, I think back in 2015 with Dan Xu.
> >> The
> >>>> goal there was to have queues with the mutexes to avoid rescheduling.
> >> As
> >>> a
> >>>> precursor Dan did some profiling and the only significant lock
> >> contention
> >>>> he could find was in the cache. That lead to the partial object
> caching
> >>>> work setting up queues for the hot locks, but it was decided that
> given
> >>> the
> >>>> lack of
> >>>> contention elsewhere, it wasn't worth the complexity.
> >>>>
> >>>> I think thread affinity is a better choice because no lock contention
> >>> will
> >>>> always beat even the most optimized lock contention resolution. If
> >>>> Continuations related to the same constellation of data objects are on
> >>> the
> >>>> same thread, then the locks are never contested, which makes it as
> fast
> >>> as
> >>>> possible.
> >>>>
> >>>> On Mon, Sep 30, 2019 at 3:45 PM Walt Karas <wkaras@verizonmedia.com
> >>>> .invalid>
> >>>> wrote:
> >>>>
> >>>>> From the longer-term TSers I've heard comments about seeing profiling
> >>>>> results that show that waiting on mutexes is a significant
> >> performance
> >>>>> issue with TS.  But I'm not aware of any write-ups of such results.
> >>>>> Unfortunately, I'm relatively new to TS and Linux, so I'm not
> >> currently
> >>>>> familiar with the best approaches to profiling TS.
> >>>>>
> >>>>> For better performance, I think that having a single to-run
> >>> Continuation
> >>>>> queue, or one per core, with a queue feeding multiple event threads
> >> is
> >>>> the
> >>>>> main thing.  It's more resilient to Continuations that block.  There
> >>>>> doesn't seem to be enthusiasm for getting hard-core about not having
> >>>>> blocking Continuations (see
> >>>>> https://github.com/apache/trafficserver/pull/5412 ).  I'm not sure
> >>>>> changing
> >>>>> to queue-based mutexes would have a significant performance impact.
> >>> But
> >>>> it
> >>>>> seems a cleaner design, making sure Continuations in the to-run
> >> list(s)
> >>>> are
> >>>>> actually ready to run.  But a different mutex implementation is not
> >>>>> strictly necessary in order to consolidate to-run Continuation
> >> queues.
> >>>>>
> >>>>> On Mon, Sep 30, 2019 at 2:39 PM Kees Spoelstra <
> >> kspoelstra@we-amp.com>
> >>>>> wrote:
> >>>>>
> >>>>>> Sounds very interesting.
> >>>>>> But what is the problem we're trying to solve here, I like the
> >> thread
> >>>>>> affinity because it gives us head ache free concurrency in some
> >>> cases,
> >>>>> and
> >>>>>> I'll bet that there is some code which doesn't have the proper
> >>>>> continuation
> >>>>>> mutexes because we know it runs on the same thread.
> >>>>>>
> >>>>>> Are we seeing a lot of imbalanced threads (too much processing
> >>> causing
> >>>>> long
> >>>>>> queues of continuations, which I can imagine in some cases) ? And
> >>>>> shouldn't
> >>>>>> we balance based on transactions or connections, move those around
> >>> when
> >>>>> we
> >>>>>> see imbalance and aim for embarrassingly parallel processing :)
> >> Come
> >>>>>> to think of it, this might introduce another set of problems, how
> >> to
> >>>> know
> >>>>>> which continuations are part of the life cycle of a connection :/
> >>>>>>
> >>>>>> Jumping threads in one transaction is not always ideal either, this
> >>> can
> >>>>>> really hurt performance. But your proposed model seems to handle
> >> that
> >>>>>> somewhat better than the current implementation.
> >>>>>>
> >>>>>> Very interested and wondering what this would mean for plugin
> >>>> developers.
> >>>>>>
> >>>>>> On Mon, 30 Sep 2019, 19:20 Walt Karas, <wkaras@verizonmedia.com
> >>>> .invalid>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> If a Continuation is scheduled, but its mutex is locked, it's put
> >>> in
> >>>> a
> >>>>>>> queue specific to that mutex.  The release function for the mutex
> >>>>> (called
> >>>>>>> when a Continuation holding the mutex exists) would put the
> >>>>> Continuation
> >>>>>> at
> >>>>>>> the front of the mutex's queue (if not empty) into the
> >> ready-to-run
> >>>>> queue
> >>>>>>> (transferring the lock to that Continuation).  A drawback is that
> >>> the
> >>>>>> queue
> >>>>>>> would itself need a mutex (spinlock?), but the critical section
> >>> would
> >>>>> be
> >>>>>>> very short.
> >>>>>>>
> >>>>>>> There would be a function to lock a mutex directly.  It would
> >>> create
> >>>> a
> >>>>>>> Continuation that had two condition variables.  It would assign
> >> the
> >>>>> mutex
> >>>>>>> to this Continuation and schedule it.  (In this case, it might
> >> make
> >>>>> sense
> >>>>>>> to put this Continuation at the front of the mutex's queue, since
> >>> it
> >>>>>> would
> >>>>>>> be blocking an entire event thread.)  The direct-lock function
> >>> would
> >>>>> then
> >>>>>>> block on the first condition variable.  When the Continuation
> >> ran,
> >>> it
> >>>>>> would
> >>>>>>> trigger the first condition variable, and then block on the
> >> second
> >>>>>>> condition variable.  The direct-lock function would then exit,
> >>>> allowing
> >>>>>> the
> >>>>>>> calling code to enter its critical section.  At the end of the
> >>>> critical
> >>>>>>> section, another function to release the direct lock would be
> >>> called.
> >>>>> It
> >>>>>>> would trigger the second condition variable, which would cause
> >> the
> >>>>>> function
> >>>>>>> of the Continuation created for the direct lock to exit (thus
> >>>> releasing
> >>>>>> the
> >>>>>>> mutex).
> >>>>>>>
> >>>>>>> With this approach, I'm not sure thread affinities would be of
> >> any
> >>>>> value.
> >>>>>>> I think perhaps each core should have it's own list of
> >> ready-to-run
> >>>>>>> Continuations, and a pool of event threads with affinity to that
> >>>> core.
> >>>>>> Not
> >>>>>>> having per-event-thread ready-to-run lists means that a
> >>> Continuation
> >>>>>>> function that blocks is less likely to block other ready-to-run
> >>>>>>> Continuations.  If Continuations had core affinities to some
> >>> degree,
> >>>>> this
> >>>>>>> might reduce evictions in per-core memory cache.  (Multiple
> >>>>> Continuations
> >>>>>>> having the same function should have the same core affinity.)
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>

Re: Tentative proposal: TS mutexes should be queues not OS mutexes

Posted by Fei Deng <du...@apache.org>.
I've done some profiling before, using Intel's VTune, you can ask Jason for
details.

On Wed, Oct 2, 2019 at 1:32 PM Leif Hedstrom <zw...@apache.org> wrote:

>
>
> > On Oct 2, 2019, at 11:23 AM, Walt Karas <wk...@verizonmedia.com.INVALID>
> wrote:
> >
> > What's the best tool for multi-threaded profiling on Linux?
>
> Probably the Intel tools.  Probably the “perf” toolchain works well, at
> least on modern linux, you could do perf lock etc.
>
> >
> > On Wed, Oct 2, 2019 at 10:14 AM Alan Carroll
> > <so...@verizonmedia.com.invalid> wrote:
> >
> >> Correct, it doesn't mean no lock contention. The claim was it reduced
> the
> >> lock contention to a level where it's not significant enough to warrant
> >> additional preventative measures. The data Dan got wasn't from code
> >> analysis, but from run time profiling. That was a while ago so if you'd
> >> like to perform another pass of measuring the level of lock contention,
> >> that would certainly be interesting data.
>
>
> So, before we dig outselves into this rathole, can someone explain to me
> what the problem is? Where are the metrics / analysis showing that we have
> a problem? I’d love to see a comparison too between various version of ATS,
> say v6 - v9, and understand where, if anywhere, we made things (lock
> contention) worse.
>
> We have to stop making complicated and invasive solutions without real
> problems, and understanding such problem.
>
> My $0.01,
>
> — Leif
>
> >>
> >> In addition, the push for thread affinity started from actual issues in
> >> production with Continuations being scheduled on different threads of
> the
> >> same type (that is, it was Kees' fault). Those would not be resolved by
> >> faster scheduling on different threads.
> >>
> >> On Tue, Oct 1, 2019 at 11:49 AM Walt Karas <wkaras@verizonmedia.com
> >> .invalid>
> >> wrote:
> >>
> >>> I assume thread affinity can't literal mean no lock contention.  You'd
> >> need
> >>> a lock on the thread run queue wouldn't you?  Continuations can't only
> >> get
> >>> queued for the event thread from the event thread.  I don't think we
> can
> >>> say conclusively that there would be a significant difference due to
> lock
> >>> contention.  I'm guessing that Fei would agree that the Continuation
> >>> dispatch code is difficult to understand and work on.  Simplification
> and
> >>> more modularity is obviously a goal.  Seems like it would be simpler if
> >> all
> >>> the Continuations in a to-run list where actually ready to run.
> >>>
> >>> On Tue, Oct 1, 2019 at 9:22 AM Alan Carroll
> >>> <so...@verizonmedia.com.invalid> wrote:
> >>>
> >>>> Do you have any more specific information on mutex contention? We have
> >> in
> >>>> fact already looked at doing this, I think back in 2015 with Dan Xu.
> >> The
> >>>> goal there was to have queues with the mutexes to avoid rescheduling.
> >> As
> >>> a
> >>>> precursor Dan did some profiling and the only significant lock
> >> contention
> >>>> he could find was in the cache. That lead to the partial object
> caching
> >>>> work setting up queues for the hot locks, but it was decided that
> given
> >>> the
> >>>> lack of
> >>>> contention elsewhere, it wasn't worth the complexity.
> >>>>
> >>>> I think thread affinity is a better choice because no lock contention
> >>> will
> >>>> always beat even the most optimized lock contention resolution. If
> >>>> Continuations related to the same constellation of data objects are on
> >>> the
> >>>> same thread, then the locks are never contested, which makes it as
> fast
> >>> as
> >>>> possible.
> >>>>
> >>>> On Mon, Sep 30, 2019 at 3:45 PM Walt Karas <wkaras@verizonmedia.com
> >>>> .invalid>
> >>>> wrote:
> >>>>
> >>>>> From the longer-term TSers I've heard comments about seeing profiling
> >>>>> results that show that waiting on mutexes is a significant
> >> performance
> >>>>> issue with TS.  But I'm not aware of any write-ups of such results.
> >>>>> Unfortunately, I'm relatively new to TS and Linux, so I'm not
> >> currently
> >>>>> familiar with the best approaches to profiling TS.
> >>>>>
> >>>>> For better performance, I think that having a single to-run
> >>> Continuation
> >>>>> queue, or one per core, with a queue feeding multiple event threads
> >> is
> >>>> the
> >>>>> main thing.  It's more resilient to Continuations that block.  There
> >>>>> doesn't seem to be enthusiasm for getting hard-core about not having
> >>>>> blocking Continuations (see
> >>>>> https://github.com/apache/trafficserver/pull/5412 ).  I'm not sure
> >>>>> changing
> >>>>> to queue-based mutexes would have a significant performance impact.
> >>> But
> >>>> it
> >>>>> seems a cleaner design, making sure Continuations in the to-run
> >> list(s)
> >>>> are
> >>>>> actually ready to run.  But a different mutex implementation is not
> >>>>> strictly necessary in order to consolidate to-run Continuation
> >> queues.
> >>>>>
> >>>>> On Mon, Sep 30, 2019 at 2:39 PM Kees Spoelstra <
> >> kspoelstra@we-amp.com>
> >>>>> wrote:
> >>>>>
> >>>>>> Sounds very interesting.
> >>>>>> But what is the problem we're trying to solve here, I like the
> >> thread
> >>>>>> affinity because it gives us head ache free concurrency in some
> >>> cases,
> >>>>> and
> >>>>>> I'll bet that there is some code which doesn't have the proper
> >>>>> continuation
> >>>>>> mutexes because we know it runs on the same thread.
> >>>>>>
> >>>>>> Are we seeing a lot of imbalanced threads (too much processing
> >>> causing
> >>>>> long
> >>>>>> queues of continuations, which I can imagine in some cases) ? And
> >>>>> shouldn't
> >>>>>> we balance based on transactions or connections, move those around
> >>> when
> >>>>> we
> >>>>>> see imbalance and aim for embarrassingly parallel processing :)
> >> Come
> >>>>>> to think of it, this might introduce another set of problems, how
> >> to
> >>>> know
> >>>>>> which continuations are part of the life cycle of a connection :/
> >>>>>>
> >>>>>> Jumping threads in one transaction is not always ideal either, this
> >>> can
> >>>>>> really hurt performance. But your proposed model seems to handle
> >> that
> >>>>>> somewhat better than the current implementation.
> >>>>>>
> >>>>>> Very interested and wondering what this would mean for plugin
> >>>> developers.
> >>>>>>
> >>>>>> On Mon, 30 Sep 2019, 19:20 Walt Karas, <wkaras@verizonmedia.com
> >>>> .invalid>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> If a Continuation is scheduled, but its mutex is locked, it's put
> >>> in
> >>>> a
> >>>>>>> queue specific to that mutex.  The release function for the mutex
> >>>>> (called
> >>>>>>> when a Continuation holding the mutex exists) would put the
> >>>>> Continuation
> >>>>>> at
> >>>>>>> the front of the mutex's queue (if not empty) into the
> >> ready-to-run
> >>>>> queue
> >>>>>>> (transferring the lock to that Continuation).  A drawback is that
> >>> the
> >>>>>> queue
> >>>>>>> would itself need a mutex (spinlock?), but the critical section
> >>> would
> >>>>> be
> >>>>>>> very short.
> >>>>>>>
> >>>>>>> There would be a function to lock a mutex directly.  It would
> >>> create
> >>>> a
> >>>>>>> Continuation that had two condition variables.  It would assign
> >> the
> >>>>> mutex
> >>>>>>> to this Continuation and schedule it.  (In this case, it might
> >> make
> >>>>> sense
> >>>>>>> to put this Continuation at the front of the mutex's queue, since
> >>> it
> >>>>>> would
> >>>>>>> be blocking an entire event thread.)  The direct-lock function
> >>> would
> >>>>> then
> >>>>>>> block on the first condition variable.  When the Continuation
> >> ran,
> >>> it
> >>>>>> would
> >>>>>>> trigger the first condition variable, and then block on the
> >> second
> >>>>>>> condition variable.  The direct-lock function would then exit,
> >>>> allowing
> >>>>>> the
> >>>>>>> calling code to enter its critical section.  At the end of the
> >>>> critical
> >>>>>>> section, another function to release the direct lock would be
> >>> called.
> >>>>> It
> >>>>>>> would trigger the second condition variable, which would cause
> >> the
> >>>>>> function
> >>>>>>> of the Continuation created for the direct lock to exit (thus
> >>>> releasing
> >>>>>> the
> >>>>>>> mutex).
> >>>>>>>
> >>>>>>> With this approach, I'm not sure thread affinities would be of
> >> any
> >>>>> value.
> >>>>>>> I think perhaps each core should have it's own list of
> >> ready-to-run
> >>>>>>> Continuations, and a pool of event threads with affinity to that
> >>>> core.
> >>>>>> Not
> >>>>>>> having per-event-thread ready-to-run lists means that a
> >>> Continuation
> >>>>>>> function that blocks is less likely to block other ready-to-run
> >>>>>>> Continuations.  If Continuations had core affinities to some
> >>> degree,
> >>>>> this
> >>>>>>> might reduce evictions in per-core memory cache.  (Multiple
> >>>>> Continuations
> >>>>>>> having the same function should have the same core affinity.)
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>
>

Re: Tentative proposal: TS mutexes should be queues not OS mutexes

Posted by Leif Hedstrom <zw...@apache.org>.

> On Oct 2, 2019, at 11:23 AM, Walt Karas <wk...@verizonmedia.com.INVALID> wrote:
> 
> What's the best tool for multi-threaded profiling on Linux?

Probably the Intel tools.  Probably the “perf” toolchain works well, at least on modern linux, you could do perf lock etc.

> 
> On Wed, Oct 2, 2019 at 10:14 AM Alan Carroll
> <so...@verizonmedia.com.invalid> wrote:
> 
>> Correct, it doesn't mean no lock contention. The claim was it reduced the
>> lock contention to a level where it's not significant enough to warrant
>> additional preventative measures. The data Dan got wasn't from code
>> analysis, but from run time profiling. That was a while ago so if you'd
>> like to perform another pass of measuring the level of lock contention,
>> that would certainly be interesting data.


So, before we dig outselves into this rathole, can someone explain to me what the problem is? Where are the metrics / analysis showing that we have a problem? I’d love to see a comparison too between various version of ATS, say v6 - v9, and understand where, if anywhere, we made things (lock contention) worse.

We have to stop making complicated and invasive solutions without real problems, and understanding such problem.

My $0.01,

— Leif

>> 
>> In addition, the push for thread affinity started from actual issues in
>> production with Continuations being scheduled on different threads of the
>> same type (that is, it was Kees' fault). Those would not be resolved by
>> faster scheduling on different threads.
>> 
>> On Tue, Oct 1, 2019 at 11:49 AM Walt Karas <wkaras@verizonmedia.com
>> .invalid>
>> wrote:
>> 
>>> I assume thread affinity can't literal mean no lock contention.  You'd
>> need
>>> a lock on the thread run queue wouldn't you?  Continuations can't only
>> get
>>> queued for the event thread from the event thread.  I don't think we can
>>> say conclusively that there would be a significant difference due to lock
>>> contention.  I'm guessing that Fei would agree that the Continuation
>>> dispatch code is difficult to understand and work on.  Simplification and
>>> more modularity is obviously a goal.  Seems like it would be simpler if
>> all
>>> the Continuations in a to-run list where actually ready to run.
>>> 
>>> On Tue, Oct 1, 2019 at 9:22 AM Alan Carroll
>>> <so...@verizonmedia.com.invalid> wrote:
>>> 
>>>> Do you have any more specific information on mutex contention? We have
>> in
>>>> fact already looked at doing this, I think back in 2015 with Dan Xu.
>> The
>>>> goal there was to have queues with the mutexes to avoid rescheduling.
>> As
>>> a
>>>> precursor Dan did some profiling and the only significant lock
>> contention
>>>> he could find was in the cache. That lead to the partial object caching
>>>> work setting up queues for the hot locks, but it was decided that given
>>> the
>>>> lack of
>>>> contention elsewhere, it wasn't worth the complexity.
>>>> 
>>>> I think thread affinity is a better choice because no lock contention
>>> will
>>>> always beat even the most optimized lock contention resolution. If
>>>> Continuations related to the same constellation of data objects are on
>>> the
>>>> same thread, then the locks are never contested, which makes it as fast
>>> as
>>>> possible.
>>>> 
>>>> On Mon, Sep 30, 2019 at 3:45 PM Walt Karas <wkaras@verizonmedia.com
>>>> .invalid>
>>>> wrote:
>>>> 
>>>>> From the longer-term TSers I've heard comments about seeing profiling
>>>>> results that show that waiting on mutexes is a significant
>> performance
>>>>> issue with TS.  But I'm not aware of any write-ups of such results.
>>>>> Unfortunately, I'm relatively new to TS and Linux, so I'm not
>> currently
>>>>> familiar with the best approaches to profiling TS.
>>>>> 
>>>>> For better performance, I think that having a single to-run
>>> Continuation
>>>>> queue, or one per core, with a queue feeding multiple event threads
>> is
>>>> the
>>>>> main thing.  It's more resilient to Continuations that block.  There
>>>>> doesn't seem to be enthusiasm for getting hard-core about not having
>>>>> blocking Continuations (see
>>>>> https://github.com/apache/trafficserver/pull/5412 ).  I'm not sure
>>>>> changing
>>>>> to queue-based mutexes would have a significant performance impact.
>>> But
>>>> it
>>>>> seems a cleaner design, making sure Continuations in the to-run
>> list(s)
>>>> are
>>>>> actually ready to run.  But a different mutex implementation is not
>>>>> strictly necessary in order to consolidate to-run Continuation
>> queues.
>>>>> 
>>>>> On Mon, Sep 30, 2019 at 2:39 PM Kees Spoelstra <
>> kspoelstra@we-amp.com>
>>>>> wrote:
>>>>> 
>>>>>> Sounds very interesting.
>>>>>> But what is the problem we're trying to solve here, I like the
>> thread
>>>>>> affinity because it gives us head ache free concurrency in some
>>> cases,
>>>>> and
>>>>>> I'll bet that there is some code which doesn't have the proper
>>>>> continuation
>>>>>> mutexes because we know it runs on the same thread.
>>>>>> 
>>>>>> Are we seeing a lot of imbalanced threads (too much processing
>>> causing
>>>>> long
>>>>>> queues of continuations, which I can imagine in some cases) ? And
>>>>> shouldn't
>>>>>> we balance based on transactions or connections, move those around
>>> when
>>>>> we
>>>>>> see imbalance and aim for embarrassingly parallel processing :)
>> Come
>>>>>> to think of it, this might introduce another set of problems, how
>> to
>>>> know
>>>>>> which continuations are part of the life cycle of a connection :/
>>>>>> 
>>>>>> Jumping threads in one transaction is not always ideal either, this
>>> can
>>>>>> really hurt performance. But your proposed model seems to handle
>> that
>>>>>> somewhat better than the current implementation.
>>>>>> 
>>>>>> Very interested and wondering what this would mean for plugin
>>>> developers.
>>>>>> 
>>>>>> On Mon, 30 Sep 2019, 19:20 Walt Karas, <wkaras@verizonmedia.com
>>>> .invalid>
>>>>>> wrote:
>>>>>> 
>>>>>>> If a Continuation is scheduled, but its mutex is locked, it's put
>>> in
>>>> a
>>>>>>> queue specific to that mutex.  The release function for the mutex
>>>>> (called
>>>>>>> when a Continuation holding the mutex exists) would put the
>>>>> Continuation
>>>>>> at
>>>>>>> the front of the mutex's queue (if not empty) into the
>> ready-to-run
>>>>> queue
>>>>>>> (transferring the lock to that Continuation).  A drawback is that
>>> the
>>>>>> queue
>>>>>>> would itself need a mutex (spinlock?), but the critical section
>>> would
>>>>> be
>>>>>>> very short.
>>>>>>> 
>>>>>>> There would be a function to lock a mutex directly.  It would
>>> create
>>>> a
>>>>>>> Continuation that had two condition variables.  It would assign
>> the
>>>>> mutex
>>>>>>> to this Continuation and schedule it.  (In this case, it might
>> make
>>>>> sense
>>>>>>> to put this Continuation at the front of the mutex's queue, since
>>> it
>>>>>> would
>>>>>>> be blocking an entire event thread.)  The direct-lock function
>>> would
>>>>> then
>>>>>>> block on the first condition variable.  When the Continuation
>> ran,
>>> it
>>>>>> would
>>>>>>> trigger the first condition variable, and then block on the
>> second
>>>>>>> condition variable.  The direct-lock function would then exit,
>>>> allowing
>>>>>> the
>>>>>>> calling code to enter its critical section.  At the end of the
>>>> critical
>>>>>>> section, another function to release the direct lock would be
>>> called.
>>>>> It
>>>>>>> would trigger the second condition variable, which would cause
>> the
>>>>>> function
>>>>>>> of the Continuation created for the direct lock to exit (thus
>>>> releasing
>>>>>> the
>>>>>>> mutex).
>>>>>>> 
>>>>>>> With this approach, I'm not sure thread affinities would be of
>> any
>>>>> value.
>>>>>>> I think perhaps each core should have it's own list of
>> ready-to-run
>>>>>>> Continuations, and a pool of event threads with affinity to that
>>>> core.
>>>>>> Not
>>>>>>> having per-event-thread ready-to-run lists means that a
>>> Continuation
>>>>>>> function that blocks is less likely to block other ready-to-run
>>>>>>> Continuations.  If Continuations had core affinities to some
>>> degree,
>>>>> this
>>>>>>> might reduce evictions in per-core memory cache.  (Multiple
>>>>> Continuations
>>>>>>> having the same function should have the same core affinity.)
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 


Re: Tentative proposal: TS mutexes should be queues not OS mutexes

Posted by Walt Karas <wk...@verizonmedia.com.INVALID>.
What's the best tool for multi-threaded profiling on Linux?

On Wed, Oct 2, 2019 at 10:14 AM Alan Carroll
<so...@verizonmedia.com.invalid> wrote:

> Correct, it doesn't mean no lock contention. The claim was it reduced the
> lock contention to a level where it's not significant enough to warrant
> additional preventative measures. The data Dan got wasn't from code
> analysis, but from run time profiling. That was a while ago so if you'd
> like to perform another pass of measuring the level of lock contention,
> that would certainly be interesting data.
>
> In addition, the push for thread affinity started from actual issues in
> production with Continuations being scheduled on different threads of the
> same type (that is, it was Kees' fault). Those would not be resolved by
> faster scheduling on different threads.
>
> On Tue, Oct 1, 2019 at 11:49 AM Walt Karas <wkaras@verizonmedia.com
> .invalid>
> wrote:
>
> > I assume thread affinity can't literal mean no lock contention.  You'd
> need
> > a lock on the thread run queue wouldn't you?  Continuations can't only
> get
> > queued for the event thread from the event thread.  I don't think we can
> > say conclusively that there would be a significant difference due to lock
> > contention.  I'm guessing that Fei would agree that the Continuation
> > dispatch code is difficult to understand and work on.  Simplification and
> > more modularity is obviously a goal.  Seems like it would be simpler if
> all
> > the Continuations in a to-run list where actually ready to run.
> >
> > On Tue, Oct 1, 2019 at 9:22 AM Alan Carroll
> > <so...@verizonmedia.com.invalid> wrote:
> >
> > > Do you have any more specific information on mutex contention? We have
> in
> > > fact already looked at doing this, I think back in 2015 with Dan Xu.
> The
> > > goal there was to have queues with the mutexes to avoid rescheduling.
> As
> > a
> > > precursor Dan did some profiling and the only significant lock
> contention
> > > he could find was in the cache. That lead to the partial object caching
> > > work setting up queues for the hot locks, but it was decided that given
> > the
> > > lack of
> > > contention elsewhere, it wasn't worth the complexity.
> > >
> > > I think thread affinity is a better choice because no lock contention
> > will
> > > always beat even the most optimized lock contention resolution. If
> > > Continuations related to the same constellation of data objects are on
> > the
> > > same thread, then the locks are never contested, which makes it as fast
> > as
> > > possible.
> > >
> > > On Mon, Sep 30, 2019 at 3:45 PM Walt Karas <wkaras@verizonmedia.com
> > > .invalid>
> > > wrote:
> > >
> > > > From the longer-term TSers I've heard comments about seeing profiling
> > > > results that show that waiting on mutexes is a significant
> performance
> > > > issue with TS.  But I'm not aware of any write-ups of such results.
> > > > Unfortunately, I'm relatively new to TS and Linux, so I'm not
> currently
> > > > familiar with the best approaches to profiling TS.
> > > >
> > > > For better performance, I think that having a single to-run
> > Continuation
> > > > queue, or one per core, with a queue feeding multiple event threads
> is
> > > the
> > > > main thing.  It's more resilient to Continuations that block.  There
> > > > doesn't seem to be enthusiasm for getting hard-core about not having
> > > > blocking Continuations (see
> > > > https://github.com/apache/trafficserver/pull/5412 ).  I'm not sure
> > > > changing
> > > > to queue-based mutexes would have a significant performance impact.
> > But
> > > it
> > > > seems a cleaner design, making sure Continuations in the to-run
> list(s)
> > > are
> > > > actually ready to run.  But a different mutex implementation is not
> > > > strictly necessary in order to consolidate to-run Continuation
> queues.
> > > >
> > > > On Mon, Sep 30, 2019 at 2:39 PM Kees Spoelstra <
> kspoelstra@we-amp.com>
> > > > wrote:
> > > >
> > > > > Sounds very interesting.
> > > > > But what is the problem we're trying to solve here, I like the
> thread
> > > > > affinity because it gives us head ache free concurrency in some
> > cases,
> > > > and
> > > > > I'll bet that there is some code which doesn't have the proper
> > > > continuation
> > > > > mutexes because we know it runs on the same thread.
> > > > >
> > > > > Are we seeing a lot of imbalanced threads (too much processing
> > causing
> > > > long
> > > > > queues of continuations, which I can imagine in some cases) ? And
> > > > shouldn't
> > > > > we balance based on transactions or connections, move those around
> > when
> > > > we
> > > > > see imbalance and aim for embarrassingly parallel processing :)
> Come
> > > > > to think of it, this might introduce another set of problems, how
> to
> > > know
> > > > > which continuations are part of the life cycle of a connection :/
> > > > >
> > > > > Jumping threads in one transaction is not always ideal either, this
> > can
> > > > > really hurt performance. But your proposed model seems to handle
> that
> > > > > somewhat better than the current implementation.
> > > > >
> > > > > Very interested and wondering what this would mean for plugin
> > > developers.
> > > > >
> > > > > On Mon, 30 Sep 2019, 19:20 Walt Karas, <wkaras@verizonmedia.com
> > > .invalid>
> > > > > wrote:
> > > > >
> > > > > > If a Continuation is scheduled, but its mutex is locked, it's put
> > in
> > > a
> > > > > > queue specific to that mutex.  The release function for the mutex
> > > > (called
> > > > > > when a Continuation holding the mutex exists) would put the
> > > > Continuation
> > > > > at
> > > > > > the front of the mutex's queue (if not empty) into the
> ready-to-run
> > > > queue
> > > > > > (transferring the lock to that Continuation).  A drawback is that
> > the
> > > > > queue
> > > > > > would itself need a mutex (spinlock?), but the critical section
> > would
> > > > be
> > > > > > very short.
> > > > > >
> > > > > > There would be a function to lock a mutex directly.  It would
> > create
> > > a
> > > > > > Continuation that had two condition variables.  It would assign
> the
> > > > mutex
> > > > > > to this Continuation and schedule it.  (In this case, it might
> make
> > > > sense
> > > > > > to put this Continuation at the front of the mutex's queue, since
> > it
> > > > > would
> > > > > > be blocking an entire event thread.)  The direct-lock function
> > would
> > > > then
> > > > > > block on the first condition variable.  When the Continuation
> ran,
> > it
> > > > > would
> > > > > > trigger the first condition variable, and then block on the
> second
> > > > > > condition variable.  The direct-lock function would then exit,
> > > allowing
> > > > > the
> > > > > > calling code to enter its critical section.  At the end of the
> > > critical
> > > > > > section, another function to release the direct lock would be
> > called.
> > > > It
> > > > > > would trigger the second condition variable, which would cause
> the
> > > > > function
> > > > > > of the Continuation created for the direct lock to exit (thus
> > > releasing
> > > > > the
> > > > > > mutex).
> > > > > >
> > > > > > With this approach, I'm not sure thread affinities would be of
> any
> > > > value.
> > > > > > I think perhaps each core should have it's own list of
> ready-to-run
> > > > > > Continuations, and a pool of event threads with affinity to that
> > > core.
> > > > > Not
> > > > > > having per-event-thread ready-to-run lists means that a
> > Continuation
> > > > > > function that blocks is less likely to block other ready-to-run
> > > > > > Continuations.  If Continuations had core affinities to some
> > degree,
> > > > this
> > > > > > might reduce evictions in per-core memory cache.  (Multiple
> > > > Continuations
> > > > > > having the same function should have the same core affinity.)
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Tentative proposal: TS mutexes should be queues not OS mutexes

Posted by Alan Carroll <so...@verizonmedia.com.INVALID>.
Correct, it doesn't mean no lock contention. The claim was it reduced the
lock contention to a level where it's not significant enough to warrant
additional preventative measures. The data Dan got wasn't from code
analysis, but from run time profiling. That was a while ago so if you'd
like to perform another pass of measuring the level of lock contention,
that would certainly be interesting data.

In addition, the push for thread affinity started from actual issues in
production with Continuations being scheduled on different threads of the
same type (that is, it was Kees' fault). Those would not be resolved by
faster scheduling on different threads.

On Tue, Oct 1, 2019 at 11:49 AM Walt Karas <wk...@verizonmedia.com.invalid>
wrote:

> I assume thread affinity can't literal mean no lock contention.  You'd need
> a lock on the thread run queue wouldn't you?  Continuations can't only get
> queued for the event thread from the event thread.  I don't think we can
> say conclusively that there would be a significant difference due to lock
> contention.  I'm guessing that Fei would agree that the Continuation
> dispatch code is difficult to understand and work on.  Simplification and
> more modularity is obviously a goal.  Seems like it would be simpler if all
> the Continuations in a to-run list where actually ready to run.
>
> On Tue, Oct 1, 2019 at 9:22 AM Alan Carroll
> <so...@verizonmedia.com.invalid> wrote:
>
> > Do you have any more specific information on mutex contention? We have in
> > fact already looked at doing this, I think back in 2015 with Dan Xu. The
> > goal there was to have queues with the mutexes to avoid rescheduling. As
> a
> > precursor Dan did some profiling and the only significant lock contention
> > he could find was in the cache. That lead to the partial object caching
> > work setting up queues for the hot locks, but it was decided that given
> the
> > lack of
> > contention elsewhere, it wasn't worth the complexity.
> >
> > I think thread affinity is a better choice because no lock contention
> will
> > always beat even the most optimized lock contention resolution. If
> > Continuations related to the same constellation of data objects are on
> the
> > same thread, then the locks are never contested, which makes it as fast
> as
> > possible.
> >
> > On Mon, Sep 30, 2019 at 3:45 PM Walt Karas <wkaras@verizonmedia.com
> > .invalid>
> > wrote:
> >
> > > From the longer-term TSers I've heard comments about seeing profiling
> > > results that show that waiting on mutexes is a significant performance
> > > issue with TS.  But I'm not aware of any write-ups of such results.
> > > Unfortunately, I'm relatively new to TS and Linux, so I'm not currently
> > > familiar with the best approaches to profiling TS.
> > >
> > > For better performance, I think that having a single to-run
> Continuation
> > > queue, or one per core, with a queue feeding multiple event threads is
> > the
> > > main thing.  It's more resilient to Continuations that block.  There
> > > doesn't seem to be enthusiasm for getting hard-core about not having
> > > blocking Continuations (see
> > > https://github.com/apache/trafficserver/pull/5412 ).  I'm not sure
> > > changing
> > > to queue-based mutexes would have a significant performance impact.
> But
> > it
> > > seems a cleaner design, making sure Continuations in the to-run list(s)
> > are
> > > actually ready to run.  But a different mutex implementation is not
> > > strictly necessary in order to consolidate to-run Continuation queues.
> > >
> > > On Mon, Sep 30, 2019 at 2:39 PM Kees Spoelstra <ks...@we-amp.com>
> > > wrote:
> > >
> > > > Sounds very interesting.
> > > > But what is the problem we're trying to solve here, I like the thread
> > > > affinity because it gives us head ache free concurrency in some
> cases,
> > > and
> > > > I'll bet that there is some code which doesn't have the proper
> > > continuation
> > > > mutexes because we know it runs on the same thread.
> > > >
> > > > Are we seeing a lot of imbalanced threads (too much processing
> causing
> > > long
> > > > queues of continuations, which I can imagine in some cases) ? And
> > > shouldn't
> > > > we balance based on transactions or connections, move those around
> when
> > > we
> > > > see imbalance and aim for embarrassingly parallel processing :) Come
> > > > to think of it, this might introduce another set of problems, how to
> > know
> > > > which continuations are part of the life cycle of a connection :/
> > > >
> > > > Jumping threads in one transaction is not always ideal either, this
> can
> > > > really hurt performance. But your proposed model seems to handle that
> > > > somewhat better than the current implementation.
> > > >
> > > > Very interested and wondering what this would mean for plugin
> > developers.
> > > >
> > > > On Mon, 30 Sep 2019, 19:20 Walt Karas, <wkaras@verizonmedia.com
> > .invalid>
> > > > wrote:
> > > >
> > > > > If a Continuation is scheduled, but its mutex is locked, it's put
> in
> > a
> > > > > queue specific to that mutex.  The release function for the mutex
> > > (called
> > > > > when a Continuation holding the mutex exists) would put the
> > > Continuation
> > > > at
> > > > > the front of the mutex's queue (if not empty) into the ready-to-run
> > > queue
> > > > > (transferring the lock to that Continuation).  A drawback is that
> the
> > > > queue
> > > > > would itself need a mutex (spinlock?), but the critical section
> would
> > > be
> > > > > very short.
> > > > >
> > > > > There would be a function to lock a mutex directly.  It would
> create
> > a
> > > > > Continuation that had two condition variables.  It would assign the
> > > mutex
> > > > > to this Continuation and schedule it.  (In this case, it might make
> > > sense
> > > > > to put this Continuation at the front of the mutex's queue, since
> it
> > > > would
> > > > > be blocking an entire event thread.)  The direct-lock function
> would
> > > then
> > > > > block on the first condition variable.  When the Continuation ran,
> it
> > > > would
> > > > > trigger the first condition variable, and then block on the second
> > > > > condition variable.  The direct-lock function would then exit,
> > allowing
> > > > the
> > > > > calling code to enter its critical section.  At the end of the
> > critical
> > > > > section, another function to release the direct lock would be
> called.
> > > It
> > > > > would trigger the second condition variable, which would cause the
> > > > function
> > > > > of the Continuation created for the direct lock to exit (thus
> > releasing
> > > > the
> > > > > mutex).
> > > > >
> > > > > With this approach, I'm not sure thread affinities would be of any
> > > value.
> > > > > I think perhaps each core should have it's own list of ready-to-run
> > > > > Continuations, and a pool of event threads with affinity to that
> > core.
> > > > Not
> > > > > having per-event-thread ready-to-run lists means that a
> Continuation
> > > > > function that blocks is less likely to block other ready-to-run
> > > > > Continuations.  If Continuations had core affinities to some
> degree,
> > > this
> > > > > might reduce evictions in per-core memory cache.  (Multiple
> > > Continuations
> > > > > having the same function should have the same core affinity.)
> > > > >
> > > >
> > >
> >
>

Re: Tentative proposal: TS mutexes should be queues not OS mutexes

Posted by Walt Karas <wk...@verizonmedia.com.INVALID>.
I assume thread affinity can't literal mean no lock contention.  You'd need
a lock on the thread run queue wouldn't you?  Continuations can't only get
queued for the event thread from the event thread.  I don't think we can
say conclusively that there would be a significant difference due to lock
contention.  I'm guessing that Fei would agree that the Continuation
dispatch code is difficult to understand and work on.  Simplification and
more modularity is obviously a goal.  Seems like it would be simpler if all
the Continuations in a to-run list where actually ready to run.

On Tue, Oct 1, 2019 at 9:22 AM Alan Carroll
<so...@verizonmedia.com.invalid> wrote:

> Do you have any more specific information on mutex contention? We have in
> fact already looked at doing this, I think back in 2015 with Dan Xu. The
> goal there was to have queues with the mutexes to avoid rescheduling. As a
> precursor Dan did some profiling and the only significant lock contention
> he could find was in the cache. That lead to the partial object caching
> work setting up queues for the hot locks, but it was decided that given the
> lack of
> contention elsewhere, it wasn't worth the complexity.
>
> I think thread affinity is a better choice because no lock contention will
> always beat even the most optimized lock contention resolution. If
> Continuations related to the same constellation of data objects are on the
> same thread, then the locks are never contested, which makes it as fast as
> possible.
>
> On Mon, Sep 30, 2019 at 3:45 PM Walt Karas <wkaras@verizonmedia.com
> .invalid>
> wrote:
>
> > From the longer-term TSers I've heard comments about seeing profiling
> > results that show that waiting on mutexes is a significant performance
> > issue with TS.  But I'm not aware of any write-ups of such results.
> > Unfortunately, I'm relatively new to TS and Linux, so I'm not currently
> > familiar with the best approaches to profiling TS.
> >
> > For better performance, I think that having a single to-run Continuation
> > queue, or one per core, with a queue feeding multiple event threads is
> the
> > main thing.  It's more resilient to Continuations that block.  There
> > doesn't seem to be enthusiasm for getting hard-core about not having
> > blocking Continuations (see
> > https://github.com/apache/trafficserver/pull/5412 ).  I'm not sure
> > changing
> > to queue-based mutexes would have a significant performance impact.  But
> it
> > seems a cleaner design, making sure Continuations in the to-run list(s)
> are
> > actually ready to run.  But a different mutex implementation is not
> > strictly necessary in order to consolidate to-run Continuation queues.
> >
> > On Mon, Sep 30, 2019 at 2:39 PM Kees Spoelstra <ks...@we-amp.com>
> > wrote:
> >
> > > Sounds very interesting.
> > > But what is the problem we're trying to solve here, I like the thread
> > > affinity because it gives us head ache free concurrency in some cases,
> > and
> > > I'll bet that there is some code which doesn't have the proper
> > continuation
> > > mutexes because we know it runs on the same thread.
> > >
> > > Are we seeing a lot of imbalanced threads (too much processing causing
> > long
> > > queues of continuations, which I can imagine in some cases) ? And
> > shouldn't
> > > we balance based on transactions or connections, move those around when
> > we
> > > see imbalance and aim for embarrassingly parallel processing :) Come
> > > to think of it, this might introduce another set of problems, how to
> know
> > > which continuations are part of the life cycle of a connection :/
> > >
> > > Jumping threads in one transaction is not always ideal either, this can
> > > really hurt performance. But your proposed model seems to handle that
> > > somewhat better than the current implementation.
> > >
> > > Very interested and wondering what this would mean for plugin
> developers.
> > >
> > > On Mon, 30 Sep 2019, 19:20 Walt Karas, <wkaras@verizonmedia.com
> .invalid>
> > > wrote:
> > >
> > > > If a Continuation is scheduled, but its mutex is locked, it's put in
> a
> > > > queue specific to that mutex.  The release function for the mutex
> > (called
> > > > when a Continuation holding the mutex exists) would put the
> > Continuation
> > > at
> > > > the front of the mutex's queue (if not empty) into the ready-to-run
> > queue
> > > > (transferring the lock to that Continuation).  A drawback is that the
> > > queue
> > > > would itself need a mutex (spinlock?), but the critical section would
> > be
> > > > very short.
> > > >
> > > > There would be a function to lock a mutex directly.  It would create
> a
> > > > Continuation that had two condition variables.  It would assign the
> > mutex
> > > > to this Continuation and schedule it.  (In this case, it might make
> > sense
> > > > to put this Continuation at the front of the mutex's queue, since it
> > > would
> > > > be blocking an entire event thread.)  The direct-lock function would
> > then
> > > > block on the first condition variable.  When the Continuation ran, it
> > > would
> > > > trigger the first condition variable, and then block on the second
> > > > condition variable.  The direct-lock function would then exit,
> allowing
> > > the
> > > > calling code to enter its critical section.  At the end of the
> critical
> > > > section, another function to release the direct lock would be called.
> > It
> > > > would trigger the second condition variable, which would cause the
> > > function
> > > > of the Continuation created for the direct lock to exit (thus
> releasing
> > > the
> > > > mutex).
> > > >
> > > > With this approach, I'm not sure thread affinities would be of any
> > value.
> > > > I think perhaps each core should have it's own list of ready-to-run
> > > > Continuations, and a pool of event threads with affinity to that
> core.
> > > Not
> > > > having per-event-thread ready-to-run lists means that a Continuation
> > > > function that blocks is less likely to block other ready-to-run
> > > > Continuations.  If Continuations had core affinities to some degree,
> > this
> > > > might reduce evictions in per-core memory cache.  (Multiple
> > Continuations
> > > > having the same function should have the same core affinity.)
> > > >
> > >
> >
>

Re: Tentative proposal: TS mutexes should be queues not OS mutexes

Posted by Alan Carroll <so...@verizonmedia.com.INVALID>.
Do you have any more specific information on mutex contention? We have in
fact already looked at doing this, I think back in 2015 with Dan Xu. The
goal there was to have queues with the mutexes to avoid rescheduling. As a
precursor Dan did some profiling and the only significant lock contention
he could find was in the cache. That lead to the partial object caching
work setting up queues for the hot locks, but it was decided that given the
lack of
contention elsewhere, it wasn't worth the complexity.

I think thread affinity is a better choice because no lock contention will
always beat even the most optimized lock contention resolution. If
Continuations related to the same constellation of data objects are on the
same thread, then the locks are never contested, which makes it as fast as
possible.

On Mon, Sep 30, 2019 at 3:45 PM Walt Karas <wk...@verizonmedia.com.invalid>
wrote:

> From the longer-term TSers I've heard comments about seeing profiling
> results that show that waiting on mutexes is a significant performance
> issue with TS.  But I'm not aware of any write-ups of such results.
> Unfortunately, I'm relatively new to TS and Linux, so I'm not currently
> familiar with the best approaches to profiling TS.
>
> For better performance, I think that having a single to-run Continuation
> queue, or one per core, with a queue feeding multiple event threads is the
> main thing.  It's more resilient to Continuations that block.  There
> doesn't seem to be enthusiasm for getting hard-core about not having
> blocking Continuations (see
> https://github.com/apache/trafficserver/pull/5412 ).  I'm not sure
> changing
> to queue-based mutexes would have a significant performance impact.  But it
> seems a cleaner design, making sure Continuations in the to-run list(s) are
> actually ready to run.  But a different mutex implementation is not
> strictly necessary in order to consolidate to-run Continuation queues.
>
> On Mon, Sep 30, 2019 at 2:39 PM Kees Spoelstra <ks...@we-amp.com>
> wrote:
>
> > Sounds very interesting.
> > But what is the problem we're trying to solve here, I like the thread
> > affinity because it gives us head ache free concurrency in some cases,
> and
> > I'll bet that there is some code which doesn't have the proper
> continuation
> > mutexes because we know it runs on the same thread.
> >
> > Are we seeing a lot of imbalanced threads (too much processing causing
> long
> > queues of continuations, which I can imagine in some cases) ? And
> shouldn't
> > we balance based on transactions or connections, move those around when
> we
> > see imbalance and aim for embarrassingly parallel processing :) Come
> > to think of it, this might introduce another set of problems, how to know
> > which continuations are part of the life cycle of a connection :/
> >
> > Jumping threads in one transaction is not always ideal either, this can
> > really hurt performance. But your proposed model seems to handle that
> > somewhat better than the current implementation.
> >
> > Very interested and wondering what this would mean for plugin developers.
> >
> > On Mon, 30 Sep 2019, 19:20 Walt Karas, <wk...@verizonmedia.com.invalid>
> > wrote:
> >
> > > If a Continuation is scheduled, but its mutex is locked, it's put in a
> > > queue specific to that mutex.  The release function for the mutex
> (called
> > > when a Continuation holding the mutex exists) would put the
> Continuation
> > at
> > > the front of the mutex's queue (if not empty) into the ready-to-run
> queue
> > > (transferring the lock to that Continuation).  A drawback is that the
> > queue
> > > would itself need a mutex (spinlock?), but the critical section would
> be
> > > very short.
> > >
> > > There would be a function to lock a mutex directly.  It would create a
> > > Continuation that had two condition variables.  It would assign the
> mutex
> > > to this Continuation and schedule it.  (In this case, it might make
> sense
> > > to put this Continuation at the front of the mutex's queue, since it
> > would
> > > be blocking an entire event thread.)  The direct-lock function would
> then
> > > block on the first condition variable.  When the Continuation ran, it
> > would
> > > trigger the first condition variable, and then block on the second
> > > condition variable.  The direct-lock function would then exit, allowing
> > the
> > > calling code to enter its critical section.  At the end of the critical
> > > section, another function to release the direct lock would be called.
> It
> > > would trigger the second condition variable, which would cause the
> > function
> > > of the Continuation created for the direct lock to exit (thus releasing
> > the
> > > mutex).
> > >
> > > With this approach, I'm not sure thread affinities would be of any
> value.
> > > I think perhaps each core should have it's own list of ready-to-run
> > > Continuations, and a pool of event threads with affinity to that core.
> > Not
> > > having per-event-thread ready-to-run lists means that a Continuation
> > > function that blocks is less likely to block other ready-to-run
> > > Continuations.  If Continuations had core affinities to some degree,
> this
> > > might reduce evictions in per-core memory cache.  (Multiple
> Continuations
> > > having the same function should have the same core affinity.)
> > >
> >
>

Re: Tentative proposal: TS mutexes should be queues not OS mutexes

Posted by Walt Karas <wk...@verizonmedia.com.INVALID>.
From the longer-term TSers I've heard comments about seeing profiling
results that show that waiting on mutexes is a significant performance
issue with TS.  But I'm not aware of any write-ups of such results.
Unfortunately, I'm relatively new to TS and Linux, so I'm not currently
familiar with the best approaches to profiling TS.

For better performance, I think that having a single to-run Continuation
queue, or one per core, with a queue feeding multiple event threads is the
main thing.  It's more resilient to Continuations that block.  There
doesn't seem to be enthusiasm for getting hard-core about not having
blocking Continuations (see
https://github.com/apache/trafficserver/pull/5412 ).  I'm not sure changing
to queue-based mutexes would have a significant performance impact.  But it
seems a cleaner design, making sure Continuations in the to-run list(s) are
actually ready to run.  But a different mutex implementation is not
strictly necessary in order to consolidate to-run Continuation queues.

On Mon, Sep 30, 2019 at 2:39 PM Kees Spoelstra <ks...@we-amp.com>
wrote:

> Sounds very interesting.
> But what is the problem we're trying to solve here, I like the thread
> affinity because it gives us head ache free concurrency in some cases, and
> I'll bet that there is some code which doesn't have the proper continuation
> mutexes because we know it runs on the same thread.
>
> Are we seeing a lot of imbalanced threads (too much processing causing long
> queues of continuations, which I can imagine in some cases) ? And shouldn't
> we balance based on transactions or connections, move those around when we
> see imbalance and aim for embarrassingly parallel processing :) Come
> to think of it, this might introduce another set of problems, how to know
> which continuations are part of the life cycle of a connection :/
>
> Jumping threads in one transaction is not always ideal either, this can
> really hurt performance. But your proposed model seems to handle that
> somewhat better than the current implementation.
>
> Very interested and wondering what this would mean for plugin developers.
>
> On Mon, 30 Sep 2019, 19:20 Walt Karas, <wk...@verizonmedia.com.invalid>
> wrote:
>
> > If a Continuation is scheduled, but its mutex is locked, it's put in a
> > queue specific to that mutex.  The release function for the mutex (called
> > when a Continuation holding the mutex exists) would put the Continuation
> at
> > the front of the mutex's queue (if not empty) into the ready-to-run queue
> > (transferring the lock to that Continuation).  A drawback is that the
> queue
> > would itself need a mutex (spinlock?), but the critical section would be
> > very short.
> >
> > There would be a function to lock a mutex directly.  It would create a
> > Continuation that had two condition variables.  It would assign the mutex
> > to this Continuation and schedule it.  (In this case, it might make sense
> > to put this Continuation at the front of the mutex's queue, since it
> would
> > be blocking an entire event thread.)  The direct-lock function would then
> > block on the first condition variable.  When the Continuation ran, it
> would
> > trigger the first condition variable, and then block on the second
> > condition variable.  The direct-lock function would then exit, allowing
> the
> > calling code to enter its critical section.  At the end of the critical
> > section, another function to release the direct lock would be called.  It
> > would trigger the second condition variable, which would cause the
> function
> > of the Continuation created for the direct lock to exit (thus releasing
> the
> > mutex).
> >
> > With this approach, I'm not sure thread affinities would be of any value.
> > I think perhaps each core should have it's own list of ready-to-run
> > Continuations, and a pool of event threads with affinity to that core.
> Not
> > having per-event-thread ready-to-run lists means that a Continuation
> > function that blocks is less likely to block other ready-to-run
> > Continuations.  If Continuations had core affinities to some degree, this
> > might reduce evictions in per-core memory cache.  (Multiple Continuations
> > having the same function should have the same core affinity.)
> >
>

Re: Tentative proposal: TS mutexes should be queues not OS mutexes

Posted by Kees Spoelstra <ks...@we-amp.com>.
Sounds very interesting.
But what is the problem we're trying to solve here, I like the thread
affinity because it gives us head ache free concurrency in some cases, and
I'll bet that there is some code which doesn't have the proper continuation
mutexes because we know it runs on the same thread.

Are we seeing a lot of imbalanced threads (too much processing causing long
queues of continuations, which I can imagine in some cases) ? And shouldn't
we balance based on transactions or connections, move those around when we
see imbalance and aim for embarrassingly parallel processing :) Come
to think of it, this might introduce another set of problems, how to know
which continuations are part of the life cycle of a connection :/

Jumping threads in one transaction is not always ideal either, this can
really hurt performance. But your proposed model seems to handle that
somewhat better than the current implementation.

Very interested and wondering what this would mean for plugin developers.

On Mon, 30 Sep 2019, 19:20 Walt Karas, <wk...@verizonmedia.com.invalid>
wrote:

> If a Continuation is scheduled, but its mutex is locked, it's put in a
> queue specific to that mutex.  The release function for the mutex (called
> when a Continuation holding the mutex exists) would put the Continuation at
> the front of the mutex's queue (if not empty) into the ready-to-run queue
> (transferring the lock to that Continuation).  A drawback is that the queue
> would itself need a mutex (spinlock?), but the critical section would be
> very short.
>
> There would be a function to lock a mutex directly.  It would create a
> Continuation that had two condition variables.  It would assign the mutex
> to this Continuation and schedule it.  (In this case, it might make sense
> to put this Continuation at the front of the mutex's queue, since it would
> be blocking an entire event thread.)  The direct-lock function would then
> block on the first condition variable.  When the Continuation ran, it would
> trigger the first condition variable, and then block on the second
> condition variable.  The direct-lock function would then exit, allowing the
> calling code to enter its critical section.  At the end of the critical
> section, another function to release the direct lock would be called.  It
> would trigger the second condition variable, which would cause the function
> of the Continuation created for the direct lock to exit (thus releasing the
> mutex).
>
> With this approach, I'm not sure thread affinities would be of any value.
> I think perhaps each core should have it's own list of ready-to-run
> Continuations, and a pool of event threads with affinity to that core.  Not
> having per-event-thread ready-to-run lists means that a Continuation
> function that blocks is less likely to block other ready-to-run
> Continuations.  If Continuations had core affinities to some degree, this
> might reduce evictions in per-core memory cache.  (Multiple Continuations
> having the same function should have the same core affinity.)
>