You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Jim Riggs <ap...@riggs.me> on 2014/04/08 22:11:30 UTC

mod_cache thundering herd bug

https://issues.apache.org/bugzilla/show_bug.cgi?id=50317

While we are at ApacheCon, I would love to address this nasty bug with someone familiar with 2.2's mod_cache. Our sites were brought down a few times last year before we finally tracked it down to being this particular bug. I am using a crude backport of the 2.3 patch (r1023398) in 2.2. It works, but I don't know if it is correct.

Can someone look at this one with me? We really need to get this fixed in 2.2, because there is NO thundering herd protection at all as things stand right now.

- Jim


Re: mod_cache thundering herd bug

Posted by Eric Covener <co...@gmail.com>.
> Covener - Are you talking about my comments in #16 on the ticket? (https://issues.apache.org/bugzilla/show_bug.cgi?id=50317#c16)
>
> If so, do either you or Graham have thoughts on the Age header getting returned with stale content? In my testing, when stale content is getting returned, no Age header is set which appears to be a violation of HTTP 1.1.
>

yes, I think it's not that it's unset, but that the calculation
somehow uses the revalidation-in-progress check time as the basis.

-- 
Eric Covener
covener@gmail.com

Re: mod_cache thundering herd bug

Posted by Jim Riggs <ap...@riggs.me>.
On 21 Apr 2014, at 06:38, Graham Leggett <mi...@sharp.fm> wrote:

> On 19 Apr 2014, at 10:26 PM, Eric Covener <co...@gmail.com> wrote:
> 
>> Graham -- related subject brought up either in Denver or in the bug.
>> It seems that when we serve a stale file while the cache is locked,
>> the age headers are small instead of large. I got totally lost trying
>> to track down the issue, maybe it makes sense to you?  It's almost as
>> if they time of the revalidation is somehow updated early and the
>> delta in the stale cache hits is based off of that.
> 
> All thundering herd does is after letting the first conditional request through, it serves stale data (RFC willing) until that conditional request comes back or a specific maximum time is reached, whichever comes first.
> 
> The most valuable piece of information in this process is the "reason" variable, which describes the reason why something wasn't eligible for caching. In httpd v2.4 the X-Cache-Detail header will give this to you, in httpd v2.2 you'll need to log at DEBUG level to get this:
> 
>        ap_log_rerror(APLOG_MARK, APLOG_DEBUG, 0, r,
>                "cache: %s not cached. Reason: %s", r->unparsed_uri,
>                reason);
> 
> The questions to answer are:
> 
> - Is there stale content to serve? No stale content, no thundering herd protection.
> - If stale content is being deleted, identify why that is. This is likely to be unrelated to thundering herd, but rather in other parts of mod_cache.



Covener - Are you talking about my comments in #16 on the ticket? (https://issues.apache.org/bugzilla/show_bug.cgi?id=50317#c16)

If so, do either you or Graham have thoughts on the Age header getting returned with stale content? In my testing, when stale content is getting returned, no Age header is set which appears to be a violation of HTTP 1.1.


Re: mod_cache thundering herd bug

Posted by Graham Leggett <mi...@sharp.fm>.
On 19 Apr 2014, at 10:26 PM, Eric Covener <co...@gmail.com> wrote:

> Graham -- related subject brought up either in Denver or in the bug.
> It seems that when we serve a stale file while the cache is locked,
> the age headers are small instead of large. I got totally lost trying
> to track down the issue, maybe it makes sense to you?  It's almost as
> if they time of the revalidation is somehow updated early and the
> delta in the stale cache hits is based off of that.

All thundering herd does is after letting the first conditional request through, it serves stale data (RFC willing) until that conditional request comes back or a specific maximum time is reached, whichever comes first.

The most valuable piece of information in this process is the "reason" variable, which describes the reason why something wasn't eligible for caching. In httpd v2.4 the X-Cache-Detail header will give this to you, in httpd v2.2 you'll need to log at DEBUG level to get this:

        ap_log_rerror(APLOG_MARK, APLOG_DEBUG, 0, r,
                "cache: %s not cached. Reason: %s", r->unparsed_uri,
                reason);

The questions to answer are:

- Is there stale content to serve? No stale content, no thundering herd protection.
- If stale content is being deleted, identify why that is. This is likely to be unrelated to thundering herd, but rather in other parts of mod_cache.

Regards,
Graham
--


Re: mod_cache thundering herd bug

Posted by Eric Covener <co...@gmail.com>.
On Tue, Apr 8, 2014 at 4:11 PM, Jim Riggs <ap...@riggs.me> wrote:
> https://issues.apache.org/bugzilla/show_bug.cgi?id=50317
>
> While we are at ApacheCon, I would love to address this nasty bug with someone familiar with 2.2's mod_cache. Our sites were brought down a few times last year before we finally tracked it down to being this particular bug. I am using a crude backport of the 2.3 patch (r1023398) in 2.2. It works, but I don't know if it is correct.
>
> Can someone look at this one with me? We really need to get this fixed in 2.2, because there is NO thundering herd protection at all as things stand right now.
>


Graham -- related subject brought up either in Denver or in the bug.
It seems that when we serve a stale file while the cache is locked,
the age headers are small instead of large. I got totally lost trying
to track down the issue, maybe it makes sense to you?  It's almost as
if they time of the revalidation is somehow updated early and the
delta in the stale cache hits is based off of that.

-- 
Eric Covener
covener@gmail.com

Re: mod_cache thundering herd bug

Posted by Jim Riggs <ap...@riggs.me>.
On 9 Apr 2014, at 14:46, Eric Covener <co...@gmail.com> wrote:

> r1023398 for 2.2:
> 
>  http://people.apache.org/~covener/patches/httpd-2.2.x-thunder.diff
> 
> The remove_url() prevents other threads from serving a stale cached
> file during refresh of a slow response, but it's unnecessary to have a
> separate path because the refresh has to deal with 200s already.  When
> the remove_url was added, there as no thundering herd lock / no
> ability to serve stale content while one guy was reloading.


covener, mrumph, and I looked at this today at ApacheCon. I updated the bug with some comments and attached this patch.

https://issues.apache.org/bugzilla/show_bug.cgi?id=50317


Re: mod_cache thundering herd bug

Posted by Eric Covener <co...@gmail.com>.
r1023398 for 2.2:

  http://people.apache.org/~covener/patches/httpd-2.2.x-thunder.diff

The remove_url() prevents other threads from serving a stale cached
file during refresh of a slow response, but it's unnecessary to have a
separate path because the refresh has to deal with 200s already.  When
the remove_url was added, there as no thundering herd lock / no
ability to serve stale content while one guy was reloading.

On Tue, Apr 8, 2014 at 2:11 PM, Jim Riggs <ap...@riggs.me> wrote:
> https://issues.apache.org/bugzilla/show_bug.cgi?id=50317
>
> While we are at ApacheCon, I would love to address this nasty bug with someone familiar with 2.2's mod_cache. Our sites were brought down a few times last year before we finally tracked it down to being this particular bug. I am using a crude backport of the 2.3 patch (r1023398) in 2.2. It works, but I don't know if it is correct.
>
> Can someone look at this one with me? We really need to get this fixed in 2.2, because there is NO thundering herd protection at all as things stand right now.
>
> - Jim
>



-- 
Eric Covener
covener@gmail.com