You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@subversion.apache.org by st...@apache.org on 2014/04/10 20:08:46 UTC

svn commit: r1586391 - in /subversion/branches/thunder: BRANCH-README notes/thundering-herd.txt

Author: stefan2
Date: Thu Apr 10 18:08:45 2014
New Revision: 1586391

URL: http://svn.apache.org/r1586391
Log:
On the thunder branch: Add problem description and BRANCH-README.

* BRANCH-README
  (): New file.

* notes/thundering-herd.txt
  (): New file describing the problem and the attempted solution.

Added:
    subversion/branches/thunder/BRANCH-README
    subversion/branches/thunder/notes/thundering-herd.txt

Added: subversion/branches/thunder/BRANCH-README
URL: http://svn.apache.org/viewvc/subversion/branches/thunder/BRANCH-README?rev=1586391&view=auto
==============================================================================
--- subversion/branches/thunder/BRANCH-README (added)
+++ subversion/branches/thunder/BRANCH-README Thu Apr 10 18:08:45 2014
@@ -0,0 +1,5 @@
+This branch will enable the FS layer to mitigate some of the effects of
+the "Thundering Herd" problem.  See notes/thundering-herd.txt for details.
+
+Besides for implementation and testing purposes, we use this branch to
+come up with a somewhat usable interface.

Added: subversion/branches/thunder/notes/thundering-herd.txt
URL: http://svn.apache.org/viewvc/subversion/branches/thunder/notes/thundering-herd.txt?rev=1586391&view=auto
==============================================================================
--- subversion/branches/thunder/notes/thundering-herd.txt (added)
+++ subversion/branches/thunder/notes/thundering-herd.txt Thu Apr 10 18:08:45 2014
@@ -0,0 +1,50 @@
+The Problem
+-----------
+
+In certain situations, such as the announcement of a milestone being
+reched, the server gets hit by a large number of client requests for
+the same data.  Since updates or checkouts may take minutes, more of
+the same requests come in before the first ones complete.  The server
+than gets progressively slowed down.
+
+The best we may achieve is that the first request gets to read the
+data from disk and all others are being fed from the cache, basically
+saturating the network as necessary.
+
+However, there is a catch.  While the first request reads missing data
+from disk, all others get served quickly from cache and catch up until
+they miss the *same data* as the first request.  Disk access suddenly 
+acts like a synchronization barrier.  More importantly, reading and
+reconstructing data from disk is CPU expensive and blocks ressources
+such as file handles.  Both limits scalability.
+
+
+The Solution
+------------
+
+We introduce a central registry that keeps track of FS layer operations.
+Whenever a cache lookup misses, we turn to that registry and tell it
+that we intend to read / reconstruct the missing data.  Once we are
+done, we inform the registry again.
+
+If the registry already contains an entry for the same node, it delays
+the request until the first one is completed and then tells the caller
+that it has not been the first.  The caller, in turn, may then retry
+the cache lookup and will usually find the required data.  If it misses
+again, we continue with reading / reconstructing the data ourselves.
+
+There are a few caveats that we need to address:
+
+* A initial reader may get aborted due to error, stuck or otherwise
+  delayed.  The registry uses a configurable timeout for that case.
+
+* Streamy processing can make the "end of access" notification
+  unreliable, i.e. it might not get sent in all cases.  Again, the
+  timeout will prevent the worst effects.
+
+* A reader may request access to the same data more than once at the
+  same time. The second request must not get delayed in that case.
+  Example: When a file gets committed, the deltification base as well
+  as the base for the incomming delta stream may be the same reps,
+  being read as streams at the same time.
+

Re: svn commit: r1586391 - in /subversion/branches/thunder: BRANCH-README notes/thundering-herd.txt

Posted by Stefan Fuhrmann <st...@wandisco.com>.

On Mon, Apr 14, 2014 at 6:17 PM, Ivan Zhakov <iv...@visualsvn.com> wrote:

> On 12 April 2014 00:01, Stefan Fuhrmann <st...@wandisco.com>
> wrote:
> > It seems that the document in notes/ did not make it clear what
> > the actual problem is and how it applies to Subversion servers.
> > Let me try to illustrate it.
> >
> [...]
>
> >
> > Should these trends be confirmed for "real" HW, networks and ra_serf,
> > I'd love to see this code in 1.9 - after due review and feedback.
> >
>
> Hi Stefan,
>
> I got the idea from original description, but thank you for explaining
> it one more time. Anyway my concerns are still valid and I am -1 on
> implementing this at FSFS layer. Especially if we're going to release
> Subversion 1.9 in 2014.
>

Having taken more measurements and analysis on
my server hardware last weekend, it seams that
mod_dav_svn requires a refined throttling scheme.

So, unless 1.9 gets massively delayed, I now schedule
the thunder branch merger for 1.10.

> I'm also skeptical on FSFS full texts caching. I think that caching
> worth only for small metadata, that doesn't consume a lot of memory
> but expensive to build. But that's different story.
>

[Only to give you a data point:]

Well, it can be disabled and it can even be moved out
to memcached (and is the only the item type where a
network indirection is acceptable).

However, constructing fulltexts from data that is not
in disc cache can be very expensive: we read 10-ish
concurrent data streams and then do quite a bit of
unzipping and checksumming on the data.

Apache is already needs 8 CPU cores to saturate a
single 10Gb network connection (uncompressed, non-
encrypted source code) *with* 100% fulltext cache hits.
Adding more CPU load can be a problem.

-- Stefan^2.

Re: svn commit: r1586391 - in /subversion/branches/thunder: BRANCH-README notes/thundering-herd.txt

Posted by Ivan Zhakov <iv...@visualsvn.com>.

On 12 April 2014 00:01, Stefan Fuhrmann <st...@wandisco.com> wrote:
> It seems that the document in notes/ did not make it clear what
> the actual problem is and how it applies to Subversion servers.
> Let me try to illustrate it.
>
[...]

>
> Should these trends be confirmed for "real" HW, networks and ra_serf,
> I'd love to see this code in 1.9 - after due review and feedback.
>

Hi Stefan,

I got the idea from original description, but thank you for explaining
it one more time. Anyway my concerns are still valid and I am -1 on
implementing this at FSFS layer. Especially if we're going to release
Subversion 1.9 in 2014.

I'm also skeptical on FSFS full texts caching. I think that caching
worth only for small metadata, that doesn't consume a lot of memory
but expensive to build. But that's different story.

-- 
Ivan Zhakov
CTO | VisualSVN | http://www.visualsvn.com

Re: svn commit: r1586391 - in /subversion/branches/thunder: BRANCH-README notes/thundering-herd.txt

Posted by Stefan Fuhrmann <st...@wandisco.com>.

On Sat, Apr 12, 2014 at 8:35 PM, Daniel Shahaf <d....@daniel.shahaf.name>wrote:

> Stefan Fuhrmann wrote on Fri, Apr 11, 2014 at 22:01:53 +0200:
> > Now, if there were 50 requests for the *same* reconstructed data:
> >
> >     file_t repo_files[50][20]
> >     for k in 0..49 parallel_do
> >         for i in 0..19 : repo_files[k][i].open("revs/$i")
> >         result[k] = ""
> >         for i in 0..19 : result[k].combine(repo_files[k][i].read()))
> >
> > Caches don't help if they don't contain the data:
> >
> >     for k in 0..49 parallel_do
> >         result[k] = cache.lookup()   // fails for all at the same time
> >         if result[k].missing then
> >                         // moved sub-loops to sub-function for clarity
> >             result[k] = reconstruct(repo_files[k])
> >             cache.add(result[k])
> >
> > There are two major problems with that:
> >
> >     (1) We process the same data 50 times while once would suffice.
> >         SVN-internal caches did not help; however the OS may only
> >         have read the data once and then fed us from disc cache.
>
> In other words: the problem is not eviction out of the disk cache but
> having to recreate the same fulltext many times.
>

Yes. It's about what has to happen before anything gets
*into* the cache(s) and to not waste effort doing it. The
principle applies to other data as well (revprop packs,
manifest / index files etc.).

> >     (2) We keep 1000 files open.  On some systems, this may cause
> >         resource shortages.
>
> It sounds like (2) is an independent problem: we open files in advance
> of actually needing them.  (i.e., something the branch doesn't/won't
> change)
>

Yes and no. It is quite hard to change our file usage
pattern as we have to keep files open in the event of
a background 'svnadmin pack' and the branch won't
change that.

However, having only one request opening a bunch
of files instead of many requests doing the same will
avoid the resource issues in many scenarios.

> > The ideal solution / control flow would look like this:
> >
> >     for k = 0 do
> >         result[k] = reconstruct(repo_files[k])
> >         cache.add(result[k])
> >
> >     for k in 1..49 parallel_do
> >         result[k] = cache.lookup()
>
> Just to be clear, result[k] is some fulltext, right?  (of a file or of a
> directory)
>

Yes. Or similar cachable object.

> > Since we don't (can't?) coordinate requests on a global level, this
> > is what we do on the thunder branch:
>
> I'm not sure why you say we can't do global coordination.  We can use
> memcached for the fulltext cache, can't we?

What I was trying to say is that we can't / won't order the
requests in the sense that one becomes the dedicated
"scout thread" and everyone else would then be explicitly
fed from cache only.

Instead, any random one may request a given data first
and all others wanting the same data will wait for a short
period. Then, they check whether they can get the data
from cache and read it from disk if they can't.

So, all requests are still very much independent from
one another, our caching does not need to be perfect
and there are no timing constraints (cache data might
eventually get evicted etc.)

The cache is a singleton,
> so the cache's getter is a chokepoint that all threads will go through,
> so couldn't we put the entire thunder logic there?
>
> For example:
>
>     svn_error_t *
>     svn_cache__get_and_if_absent_populate(void **value,
>                                           /* svn_boolean_t *found, */
>                                           svn_cache__t *cache,
>                                           const void *key,
>                                           svn_error_t *(*factory)(void
> **value, void *baton, apr_pool_t *scratch_pool),
>                                           void *factory_baton,
>                                           apr_pool_t *result_pool,
>                                           apr_pool_t *scratch_pool);
>
> where FACTORY is called if KEY is not in cache and either (a) the
> current thread is the first in the thunder, or (b) the current thread is
> not first in the thunder, but it has slept for X seconds and the cache
> is unpopulated.
>
> Basically, keep in libsvn_fs only the determination of _which_ cache
> keys are going to be hot paths, and the implementation of the "thunder"
> considerations (let's maybe sleep to give another thread a chance) in
> svn_cache__*().
>

Well, that would be one option. I feel +/-0 on the
cache interface change. I'll think about it.

Right now, the branch uses two wrapper functions
around the cache getters and achieve the same
functionality (e.g. svn_fs_fs__thundered_cache_get).

> >     for k in 0..49 parallel_do
> >         result[k] = cache.lookup()
> >         if result[k].missing then
> >
> >             token = thunder.coordinate(data_location)
> >             if token.another_got_to_read then      // all but the first
> >         result[k] = cache.lookup()
> >         if result[k].ok : jump done        // >90% hit rate
> >
> >             result[k] = reconstruct(repo_files[k])
> >             cache.add(result[k])
> >
> >             thunder.completed(token)
> >     done
> >
> > So, there is no penalty on the hot path, i.e. when data can be found
> > in the respective cache.  The coordinating instance is also conceptually
> > simple (keep a list of all accesses in flight) and the delay for the
> > first thread is negligible.  Concurrent threads reading the same location
> > will be blocked until the initial thread completed its access.  That
> > minimizes the code churn on the calling side.
> >
> > A timeout prevents rouge threads from blocking the whole system.  Also,
> > entries that timed out will be removed from the access list.  The rouge
> > thread would have to start another relevant data access (and be the
> first)
> > to block other threads for another time.
>
> Hmm.  The way I imagined it, the non-first threads just do:
>
>    if (I_am_first):
>      value = compute_value_of(key)
>      cache[key] = value
>      return value
>    else:
>      time.sleep(SLEEP_LENGTH)
>      try:
>        return cache[key]
>      except KeyError:
>        value = compute_value_of(key)
>        cache[key] = value
>        return value
>
> That way, no thread ever calls sleep() more than once, so threads will
> never be delayed by more than SLEEP_LENGTH.
>
> (This should address Ivan's second concern.)
>

Well, this is roughly equivalent to what I implemented initially.
The difference was that I would use a timed wait instead of
sleep such that the waiting threads would resume a.s.a.p.

Testing showed that all this, combined with the various mutex
ops around it, created a 20% overhead and some hard-to-interpret
CPU load stats. So, I changed it to simply using sleep() in
r1586964 and removed all the bits that weren't necessary anymore.
Now, things seem to scale nicely in many scenarios with extra
clients increasing the processing time only slightly.

>  > My plan is to test the code on my server setup at home to get a more
> > nuanced picture of what the performance and scalability impact is.
> > On my SSD macbook, I get 70% less total CPU cost, 50% higher thoughput
> > and smooth handling of 200 concurrent c/o (vs running out of file handles
> > at 100 clients) over ra_svn.
> >
>
> FWIW, what I wonder about is how well the change scales down (to small
> setups) and how the code is going to look with the feature added.
>

It's all implemented and ready for review now.

So far, I haven't seen any measurable impact on single
request scenarios. It is clear that small, concurrent requests
may see an extra delay do to the sleep. But since we sleep
only if I/O (or at least OS interaction) is necessary and only
for 250ms max, the added user-visible delay will be short
and, in most scenarios, infrequent.

> > Should these trends be confirmed for "real" HW, networks and ra_serf,
> > I'd love to see this code in 1.9 - after due review and feedback.
>
> Having it in 1.9 if it's ready in time would be great.  That said, we're
> releasing an alpha tomorrow, and I'm honestly not sure when it would be
> too late to add this.  (That is, I'm not sure when we enter feature
> freeze.  We don't allow any new features after an RC; I'm not sure how
> long before an RC a feature such as this one would have to be merged.)
>

Well, I think the code is relatively simple and robust.
The interaction with the existing code is also clear.

I've been testing the code in those scenarios that sparked
me into attempting this solution and things look promising.
But it still needs broader testing and I want to be able to
fully explain all the data that I see. In the end, I need to be
able to justify the release in 1.9. Otherwise, it would probably
go into 1.10. I'll continue work on that once I come back after
Easter.

Until then, review and feedback would be very much appreciated.

-- Stefan^2.

Re: svn commit: r1586391 - in /subversion/branches/thunder: BRANCH-README notes/thundering-herd.txt

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

Stefan Fuhrmann wrote on Fri, Apr 11, 2014 at 22:01:53 +0200:
> Now, if there were 50 requests for the *same* reconstructed data:
> 
>     file_t repo_files[50][20]
>     for k in 0..49 parallel_do
>         for i in 0..19 : repo_files[k][i].open("revs/$i")
>         result[k] = ""
>         for i in 0..19 : result[k].combine(repo_files[k][i].read()))
> 
> Caches don't help if they don't contain the data:
> 
>     for k in 0..49 parallel_do
>         result[k] = cache.lookup()   // fails for all at the same time
>         if result[k].missing then
>                         // moved sub-loops to sub-function for clarity
>             result[k] = reconstruct(repo_files[k])
>             cache.add(result[k])
> 
> There are two major problems with that:
> 
>     (1) We process the same data 50 times while once would suffice.
>         SVN-internal caches did not help; however the OS may only
>         have read the data once and then fed us from disc cache.

In other words: the problem is not eviction out of the disk cache but
having to recreate the same fulltext many times.

>     (2) We keep 1000 files open.  On some systems, this may cause
>         resource shortages.

It sounds like (2) is an independent problem: we open files in advance
of actually needing them.  (i.e., something the branch doesn't/won't change)

> The ideal solution / control flow would look like this:
> 
>     for k = 0 do
>         result[k] = reconstruct(repo_files[k])
>         cache.add(result[k])
> 
>     for k in 1..49 parallel_do
>         result[k] = cache.lookup()

Just to be clear, result[k] is some fulltext, right?  (of a file or of a
directory)

> Since we don't (can't?) coordinate requests on a global level, this
> is what we do on the thunder branch:

I'm not sure why you say we can't do global coordination.  We can use
memcached for the fulltext cache, can't we?  The cache is a singleton,
so the cache's getter is a chokepoint that all threads will go through,
so couldn't we put the entire thunder logic there?

For example:

    svn_error_t *
    svn_cache__get_and_if_absent_populate(void **value,
                                          /* svn_boolean_t *found, */
                                          svn_cache__t *cache,
                                          const void *key,
                                          svn_error_t *(*factory)(void **value, void *baton, apr_pool_t *scratch_pool),
                                          void *factory_baton,
                                          apr_pool_t *result_pool,
                                          apr_pool_t *scratch_pool);

where FACTORY is called if KEY is not in cache and either (a) the
current thread is the first in the thunder, or (b) the current thread is
not first in the thunder, but it has slept for X seconds and the cache
is unpopulated.

Basically, keep in libsvn_fs only the determination of _which_ cache
keys are going to be hot paths, and the implementation of the "thunder"
considerations (let's maybe sleep to give another thread a chance) in
svn_cache__*().

>     for k in 0..49 parallel_do
>         result[k] = cache.lookup()
>         if result[k].missing then
> 
>             token = thunder.coordinate(data_location)
>             if token.another_got_to_read then      // all but the first
>         result[k] = cache.lookup()
>         if result[k].ok : jump done        // >90% hit rate
> 
>             result[k] = reconstruct(repo_files[k])
>             cache.add(result[k])
> 
>             thunder.completed(token)
>     done
> 
> So, there is no penalty on the hot path, i.e. when data can be found
> in the respective cache.  The coordinating instance is also conceptually
> simple (keep a list of all accesses in flight) and the delay for the
> first thread is negligible.  Concurrent threads reading the same location
> will be blocked until the initial thread completed its access.  That
> minimizes the code churn on the calling side.
> 
> A timeout prevents rouge threads from blocking the whole system.  Also,
> entries that timed out will be removed from the access list.  The rouge
> thread would have to start another relevant data access (and be the first)
> to block other threads for another time.

Hmm.  The way I imagined it, the non-first threads just do:

   if (I_am_first):
     value = compute_value_of(key)
     cache[key] = value
     return value
   else:
     time.sleep(SLEEP_LENGTH)
     try:
       return cache[key]
     except KeyError:
       value = compute_value_of(key)
       cache[key] = value
       return value

That way, no thread ever calls sleep() more than once, so threads will
never be delayed by more than SLEEP_LENGTH.

(This should address Ivan's second concern.)

> My plan is to test the code on my server setup at home to get a more
> nuanced picture of what the performance and scalability impact is.
> On my SSD macbook, I get 70% less total CPU cost, 50% higher thoughput
> and smooth handling of 200 concurrent c/o (vs running out of file handles
> at 100 clients) over ra_svn.
> 

FWIW, what I wonder about is how well the change scales down (to small
setups) and how the code is going to look with the feature added.

> Should these trends be confirmed for "real" HW, networks and ra_serf,
> I'd love to see this code in 1.9 - after due review and feedback.

Having it in 1.9 if it's ready in time would be great.  That said, we're
releasing an alpha tomorrow, and I'm honestly not sure when it would be
too late to add this.  (That is, I'm not sure when we enter feature
freeze.  We don't allow any new features after an RC; I'm not sure how
long before an RC a feature such as this one would have to be merged.)

Thanks for the explanation,

Daniel

Re: svn commit: r1586391 - in /subversion/branches/thunder: BRANCH-README notes/thundering-herd.txt

Posted by Stefan Fuhrmann <st...@wandisco.com>.

It seems that the document in notes/ did not make it clear what
the actual problem is and how it applies to Subversion servers.
Let me try to illustrate it.

Assume, we want to reconstruct a single file from the repository
as part of a single request, then this is what we effectively do
(omitting minor details):

    file_t repo_files[20]
    for i in 0..19 : repo_files[i].open("revs/$i")
    result = ""
    for i in 0..19 : result.combine(repo_files[i].read()))

Now, if there were 50 requests for the *same* reconstructed data:

    file_t repo_files[50][20]
    for k in 0..49 parallel_do
        for i in 0..19 : repo_files[k][i].open("revs/$i")
        result[k] = ""
        for i in 0..19 : result[k].combine(repo_files[k][i].read()))

Caches don't help if they don't contain the data:

    for k in 0..49 parallel_do
        result[k] = cache.lookup()   // fails for all at the same time
        if result[k].missing then
                        // moved sub-loops to sub-function for clarity
            result[k] = reconstruct(repo_files[k])
            cache.add(result[k])

There are two major problems with that:

    (1) We process the same data 50 times while once would suffice.
        SVN-internal caches did not help; however the OS may only
        have read the data once and then fed us from disc cache.
    (2) We keep 1000 files open.  On some systems, this may cause
        resource shortages.

How likely this the above scenario in SVN?  An operation like
checkout may take many minutes to complete.  The first client to
do the c/o will read data from disk and populate the server caches.
Any other client comming in later will be much faster since it gets
fed from cache.

If new c/o requests keep comming in before the first one completed,
those extra requests have a good chance of "catching up" for the
first one.  In case like ra_svn that have a fully deterministic
reporting order, all requests have a chance to gang up to the "50
requests scenario" above.  And they will do it over and over for
many files to come.

With ra_serf, things are slightly more subtle, iff the clients should
randomize their requests (not sure they do).  For them, it is metadata
(revprop packs, indexes) and data layout (temporal locality being
corelated to spacial locality) that will see the issue - albeit
in a more distributed fashon (e.g. 10 locations with 5 readers each
instead of 1 with 50).

The ideal solution / control flow would look like this:

    for k = 0 do
        result[k] = reconstruct(repo_files[k])
        cache.add(result[k])

    for k in 1..49 parallel_do
        result[k] = cache.lookup()

Since we don't (can't?) coordinate requests on a global level, this
is what we do on the thunder branch:

    for k in 0..49 parallel_do
        result[k] = cache.lookup()
        if result[k].missing then

            token = thunder.coordinate(data_location)
            if token.another_got_to_read then      // all but the first
        result[k] = cache.lookup()
        if result[k].ok : jump done        // >90% hit rate

            result[k] = reconstruct(repo_files[k])
            cache.add(result[k])

            thunder.completed(token)
    done

So, there is no penalty on the hot path, i.e. when data can be found
in the respective cache.  The coordinating instance is also conceptually
simple (keep a list of all accesses in flight) and the delay for the
first thread is negligible.  Concurrent threads reading the same location
will be blocked until the initial thread completed its access.  That
minimizes the code churn on the calling side.

A timeout prevents rouge threads from blocking the whole system.  Also,
entries that timed out will be removed from the access list.  The rouge
thread would have to start another relevant data access (and be the first)
to block other threads for another time.

My plan is to test the code on my server setup at home to get a more
nuanced picture of what the performance and scalability impact is.
On my SSD macbook, I get 70% less total CPU cost, 50% higher thoughput
and smooth handling of 200 concurrent c/o (vs running out of file handles
at 100 clients) over ra_svn.

Should these trends be confirmed for "real" HW, networks and ra_serf,
I'd love to see this code in 1.9 - after due review and feedback.


-- Stefan^2.

Re: svn commit: r1586391 - in /subversion/branches/thunder: BRANCH-README notes/thundering-herd.txt

Posted by Branko Čibej <br...@wandisco.com>.

On 11.04.2014 02:24, Daniel Shahaf wrote:
> stefan2@apache.org wrote on Thu, Apr 10, 2014 at 18:08:46 -0000:
>> +++ subversion/branches/thunder/notes/thundering-herd.txt Thu Apr 10 18:08:45 2014
>> @@ -0,0 +1,50 @@
>> +The Problem
>> +-----------
>> +
>> +In certain situations, such as the announcement of a milestone being
>> +reched, the server gets hit by a large number of client requests for
>> +the same data.  Since updates or checkouts may take minutes, more of
>> +the same requests come in before the first ones complete.  The server
>> +than gets progressively slowed down.
> Does this have to be solved in svn?
>
> The solution sounds like you are reinventing the OS' disk cache.
> Wouldn't it be better to make a syscall to hint the cache that "<these>
> files are going to be a hot path for the next few minutes", so the OS
> can then consider the size of the files v. the size of the CPU caches
> and other processes' needs and make its own decisions?
>
> I'm not against having this branch, I just don't immediately see why
> this is the best solution to the described problem.

I'd normally agree ... I'm pretty sure that in the described scenario,
the disk cache is already as hot as it'll ever get. The fact is that
that the FS layer caches data that are derived from disk contents in a
rather expensive way, and no amount of disk caching can amortize that
expense.

However: I would argue that, long term, effort is better invested into
changing the way we store data on disk to lower or, dare I say,
eliminate the expense of deriving it. Currently the cached data falls
into two distinct groups:

  * metadata (directories and nodes)
  * content (fulltext (and deltas, I think?)

Retrieving the fulltext of a file is in general going to be expensive,
because of deltification; but the thundering-herd effect isn't all that
important for file contents, compared to metadata, since the latter gets
the lions' share of accesses.

Currently, the cost of retrieving metadata is directly dependent on the
cost of retrieving content, since we naïvely store directories the same
way as file contents. I'd argue that effort is better spent on
implementing the new metadata index design which makes much better use
of disk (or rather, database page) caching.

FWIW, I agree with Ivan that the proposed block/retry approach leaves me
less than enthusiastic ... I don't think it's even provable that it
cannot cause deadlocks, because the blocking order depends on access
patterns driven by clients that we have no control over. OK, nothing
will actually deadlock since the locks are designed to expire; but I can
easily imagine scenarios where the blocking actually slows things down
overall.

Limiting the number of allowed parallel requests on the HTTPd and
svnserve level is a much better approach, IMO, and can be tweaked by the
admin depending on server capabilities. I don't think anyone's actually
ever researched how many parallel requests the repository can actually
handle in these situations. I suspect looking into that would be time
better spent.

-- Brane

-- 
Branko Čibej | Director of Subversion
WANdisco // Non-Stop Data
e. brane@wandisco.com

Re: svn commit: r1586391 - in /subversion/branches/thunder: BRANCH-README notes/thundering-herd.txt

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

stefan2@apache.org wrote on Thu, Apr 10, 2014 at 18:08:46 -0000:
> +++ subversion/branches/thunder/notes/thundering-herd.txt Thu Apr 10 18:08:45 2014
> @@ -0,0 +1,50 @@
> +The Problem
> +-----------
> +
> +In certain situations, such as the announcement of a milestone being
> +reched, the server gets hit by a large number of client requests for
> +the same data.  Since updates or checkouts may take minutes, more of
> +the same requests come in before the first ones complete.  The server
> +than gets progressively slowed down.

Does this have to be solved in svn?

The solution sounds like you are reinventing the OS' disk cache.
Wouldn't it be better to make a syscall to hint the cache that "<these>
files are going to be a hot path for the next few minutes", so the OS
can then consider the size of the files v. the size of the CPU caches
and other processes' needs and make its own decisions?

I'm not against having this branch, I just don't immediately see why
this is the best solution to the described problem.

Daniel

P.S. "Bypass the kernel's memory management strategy" even ties to the
recent OpenSSL bug: http://article.gmane.org/gmane.os.openbsd.misc/211963
(tldr: OpenSSL() used its own malloc(), but using the OS malloc() would
have prevented CVE-2014-0160 from resulting in memory disclosure)

> +The best we may achieve is that the first request gets to read the
> +data from disk and all others are being fed from the cache, basically
> +saturating the network as necessary.
> +
> +However, there is a catch.  While the first request reads missing data
> +from disk, all others get served quickly from cache and catch up until
> +they miss the *same data* as the first request.  Disk access suddenly 
> +acts like a synchronization barrier.  More importantly, reading and
> +reconstructing data from disk is CPU expensive and blocks ressources
> +such as file handles.  Both limits scalability.
> +
> +
> +The Solution
> +------------
> +
> +We introduce a central registry that keeps track of FS layer operations.
> +Whenever a cache lookup misses, we turn to that registry and tell it
> +that we intend to read / reconstruct the missing data.  Once we are
> +done, we inform the registry again.
> +
> +If the registry already contains an entry for the same node, it delays
> +the request until the first one is completed and then tells the caller
> +that it has not been the first.  The caller, in turn, may then retry
> +the cache lookup and will usually find the required data.  If it misses
> +again, we continue with reading / reconstructing the data ourselves.
> +
> +There are a few caveats that we need to address:
> +
> +* A initial reader may get aborted due to error, stuck or otherwise
> +  delayed.  The registry uses a configurable timeout for that case.
> +
> +* Streamy processing can make the "end of access" notification
> +  unreliable, i.e. it might not get sent in all cases.  Again, the
> +  timeout will prevent the worst effects.
> +
> +* A reader may request access to the same data more than once at the
> +  same time. The second request must not get delayed in that case.
> +  Example: When a file gets committed, the deltification base as well
> +  as the base for the incomming delta stream may be the same reps,
> +  being read as streams at the same time.
> +
> 
>

Re: svn commit: r1586391 - in /subversion/branches/thunder: BRANCH-README notes/thundering-herd.txt

Posted by Daniel Shahaf <d....@daniel.shahaf.name>.

stefan2@apache.org wrote on Thu, Apr 10, 2014 at 18:08:46 -0000:
> +++ subversion/branches/thunder/notes/thundering-herd.txt Thu Apr 10 18:08:45 2014
> @@ -0,0 +1,50 @@
> +The Problem
> +-----------
> +
> +In certain situations, such as the announcement of a milestone being
> +reched, the server gets hit by a large number of client requests for
> +the same data.  Since updates or checkouts may take minutes, more of
> +the same requests come in before the first ones complete.  The server
> +than gets progressively slowed down.

Does this have to be solved in svn?

The solution sounds like you are reinventing the OS' disk cache.
Wouldn't it be better to make a syscall to hint the cache that "<these>
files are going to be a hot path for the next few minutes", so the OS
can then consider the size of the files v. the size of the CPU caches
and other processes' needs and make its own decisions?

I'm not against having this branch, I just don't immediately see why
this is the best solution to the described problem.

Daniel

P.S. "Bypass the kernel's memory management strategy" even ties to the
recent OpenSSL bug: http://article.gmane.org/gmane.os.openbsd.misc/211963
(tldr: OpenSSL() used its own malloc(), but using the OS malloc() would
have prevented CVE-2014-0160 from resulting in memory disclosure)

> +The best we may achieve is that the first request gets to read the
> +data from disk and all others are being fed from the cache, basically
> +saturating the network as necessary.
> +
> +However, there is a catch.  While the first request reads missing data
> +from disk, all others get served quickly from cache and catch up until
> +they miss the *same data* as the first request.  Disk access suddenly 
> +acts like a synchronization barrier.  More importantly, reading and
> +reconstructing data from disk is CPU expensive and blocks ressources
> +such as file handles.  Both limits scalability.
> +
> +
> +The Solution
> +------------
> +
> +We introduce a central registry that keeps track of FS layer operations.
> +Whenever a cache lookup misses, we turn to that registry and tell it
> +that we intend to read / reconstruct the missing data.  Once we are
> +done, we inform the registry again.
> +
> +If the registry already contains an entry for the same node, it delays
> +the request until the first one is completed and then tells the caller
> +that it has not been the first.  The caller, in turn, may then retry
> +the cache lookup and will usually find the required data.  If it misses
> +again, we continue with reading / reconstructing the data ourselves.
> +
> +There are a few caveats that we need to address:
> +
> +* A initial reader may get aborted due to error, stuck or otherwise
> +  delayed.  The registry uses a configurable timeout for that case.
> +
> +* Streamy processing can make the "end of access" notification
> +  unreliable, i.e. it might not get sent in all cases.  Again, the
> +  timeout will prevent the worst effects.
> +
> +* A reader may request access to the same data more than once at the
> +  same time. The second request must not get delayed in that case.
> +  Example: When a file gets committed, the deltification base as well
> +  as the base for the incomming delta stream may be the same reps,
> +  being read as streams at the same time.
> +
> 
>

Re: svn commit: r1586391 - in /subversion/branches/thunder: BRANCH-README notes/thundering-herd.txt

Posted by Ivan Zhakov <iv...@visualsvn.com>.

On 10 April 2014 22:08,  <st...@apache.org> wrote:
> Author: stefan2
> Date: Thu Apr 10 18:08:45 2014
> New Revision: 1586391
>
> URL: http://svn.apache.org/r1586391
> Log:
> On the thunder branch: Add problem description and BRANCH-README.
>
> * BRANCH-README
>   (): New file.
>
> * notes/thundering-herd.txt
>   (): New file describing the problem and the attempted solution.
>
> Added:
>     subversion/branches/thunder/BRANCH-README
>     subversion/branches/thunder/notes/thundering-herd.txt
Hi Stefan,

I hope you are just experimenting  and not going to merge it to trunk.
Otherwise I'm -1 on this approach:
1. The problem should be solved on topmost layer (MPM) or lowest (OS
disk/CPU scheduling).
2. Delaying access at cache layer is guaranteed way to have sporadic
dead locks and lockups.  And these lockups will be extremely hard to
reproduce since they will happen only in heavy loaded scenarios.

-- 
Ivan Zhakov
CTO | VisualSVN | http://www.visualsvn.com