You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@trafficserver.apache.org by Brian Neradt <br...@gmail.com> on 2021/03/24 21:13:24 UTC

Design Review: How to handle negative_revalidating_lifetime

dev@trafficserver.apache.org:

Context
This design review concerns proxy.config.http.negative_revalidating_lifetime
<https://docs.trafficserver.apache.org/en/latest/admin-guide/files/records.config.en.html#proxy-config-http-negative-revalidating-lifetime>
and
is the result of investigating the feature for the following issue:

https://github.com/apache/trafficserver/issues/7425

Feature Overview
The ATS negative revalidation feature implements the option to serve stale
cached responses in the circumstance that the server is not reachable for
revalidation. This can happen due to connectivity issues against the
origin or the origin's service being down resulting in 5xx responses.

As a clarification, do not conflate negative revalidation with the negative
caching feature. While they sound alike, the two features are orthogonal to
each other. The ATS negative caching feature allows for caching of negative
HTTP responses (such as 404 "Not Found"). By contrast, as summarized above,
negative revalidation allows for serving some cached response that is stale
when the server is not reachable for revalidation.

negative_revalidating_lifetime issues
That's the negative revalidation feature: serve stale content when the
server is not accessible. This generally works as expected. The problem
lies in determining whether the stale resource is fresh enough via the
negative_revalidating_lifetime configuration for this feature. The
intention of this lifetime configuration was to provide a way for the user
to specify how old a stale resource can be when negative revalidation takes
place before it will no longer be served from the cache. Looking at the
documentation and the code for it, the *expected behavior* seemed to have
been the following:


   1. A client requests an item. The server responds with, say, a 200 OK
   and the ATS proxy caches the response.
   2. At some point in the future, a client requests this item again.
   3. ATS inspects the item and determines that it is stale.
   4. ATS attempts to validate its cached resource against the origin.
   5. Either a connection cannot be made to the origin or it receives a 5xx
   in response to its validation request.
   6. If negative revalidation is enabled, ATS compares the age of the
   stale resource against negative_revalidating_lifetime. It will then do one
   of two things:
      1. If the age of the cached resource is younger than
      negative_revalidating_lifetime, consider it "fresh enough" and
reply to the
      origin with the cached response.
      2. If the age is older than negative_revalidating_lifetime, reply
      with a 5xx response.

In reality step 6 does not happen. The stale resource is never re-evaluated
as fresh or still stale against negative_revalidating_lifetime. Rather, the
determining factor of whether the resource is served is
proxy.config.http.cache.max_stale_age
<https://docs.trafficserver.apache.org/en/latest/admin-guide/files/records.config.en.html#proxy-config-http-cache-max-stale-age>.
That is, if its age is less than max_stale_age, it is used; otherwise it is
not. In addition, an incorrectly calculated Expires header field is added
which is set to negative_revalidating_lifetime seconds from the current
time rather than negative_revalidating_lifetime seconds from the resources
calculated generation time.

These negative_revalidating_lifetime issues may very well be day zero bugs.
Looking at git, it goes back to at least 2009 in a commit
<https://github.com/apache/trafficserver/commit/a1651345d11f228dfb392cc79169760e80d25f06>
titled "Initial commit" (presumably when we transitioned source code
repositories). It has not worked for at least a very long time and people
have been relying, knowingly or not, upon max_stale_age to limit how stale
resources can be before they are used for negative revalidation.

This design review evaluates the options for how to handle
negative_revalidating_lifetime.

Options
Option 1: Document the Current Behavior
The first option is to leave the feature as is and document how
negative_revalidating_lifetime currently behaves.

Advantages:

   - This requires the least amount of work as it only involves a
   documentation change.
   - This will effectively avoid user confusion about
   negative_revalidating_lifetime so that no further issues will be filed and
   time wasted when it does not behave as expected.

Disadvantages

   - negative_revalidating_lifetime, as it currently behaves, would be hard
   to explain without embarrassing the project. We would have to say something
   like:

The value of negative_revalidating_lifetime is used to calculate an Expires
> header with a field value of negative_revalidating_lifetime seconds in the
> future from the time the cached resource is served. This header is added in
> the cached response. No other behavior is influenced by this configuration.
> The age limit enforced for such stale resources is configured
> via proxy.config.http.cache.max_stale_age.


This likely raises more questions than it answers. In reality, it would
probably be most helpful to steer users away from this feature and toward
max_stale_age.


Option 2: Fix the Issues
Alternatively, we can fix the issues surrounding
negative_revalidating_lifetime. Looking at the code, it seems that the
Expires header, which is currently incorrectly calculated, really isn't
needed. Rather we can make the negative_revalidating_lifetime conditionally
a part of the HttpTransact::is_stale_cache_response_returnable()
<https://github.com/apache/trafficserver/blob/f639621e238f778ba6f18aaa5de47772d6504e55/proxy/http/HttpTransact.cc#L5943>
function.

In addition to max_stale_age, which is currently taken into account there,
we can also compare against negative_revalidating_lifetime to see whether
the resource can be served (but only when negative revalidation is being
used...the function is used in other contexts).

Advantages:

   - This addresses both the known issues with
   negative_revalidating_lifetime in that the Expires header is no longer
   added and the negative_revalidating_lifetime is effectively used in
   determining whether the item is fresh enough for the response.
   - The fix doesn't look all that hard to implement.

Disadvantages:

   - Implementing the feature adds various complexities to ATS. It
   complicates the documentation, it complicates the code, and will require
   tests to verify it works correctly. None of this complexity is large if the
   feature is worth it, and I'm happy to do it. But I have my doubts that it
   is worth it. Keep in mind that negative_revalidating_lifetime has never
   been a working feature in ATS and this has caused problems for no one. If
   its absence had caused problems, we would know about it because people
   would complain. In fact, the only reported issue with the feature is the
   confusion it caused by existing in the document in the first place. Also,
   keep in mind that this feature would only be useful, if it is at all, to
   *extend* the lifetime of these cached resources past max_stale_age. Both
   of those values would be used and we would want to use the max of each.
   Currently max_stale_age defaults to 7 days. How many users will want
   max_stale_age to be one value for general max-age computations, but use a
   different, larger, value for negative revalidation? I'm guessing not many.
   (Alternatively, I suppose we could use only negative_revalidating_lifetime
   instead of using both it and max_stale_age in these situations, but that
   would potentially be a breaking change for current configurations in which
   negative_revalidation would now only happen for the default 30 minutes of
   negative_revalidating_lifetime rather than the default 7 days of
   max_stale_age.)


Option 3: Remove negative_revalidating_lifetime
We can remove negative_revalidating_lifetime from the product. This would
involve:

   - Removing the documentation of negative_revalidating_lifetime
   <https://docs.trafficserver.apache.org/en/latest/admin-guide/files/records.config.en.html#proxy-config-http-negative-revalidating-lifetime>
   from our records.config documentation. We would also want to explain the
   role that max_stale_age plays in limiting the usage of stale resources in
   negative revalidating circumstances.
   - Removing the addition of the faulty Expires header from HttpTransact.cc
   <https://github.com/apache/trafficserver/blob/f639621e238f778ba6f18aaa5de47772d6504e55/proxy/http/HttpTransact.cc#L4364>
   - Keep the parsing of negative_revalidating_lifetime, if it is used by a
   user, but add a Warning that it is defunct, deprecated, and will be removed
   in a future release.

Advantages:

   - This removes the confusion associated with having both a
   negative_revalidating_lifetime feature used for these negative revalidating
   resources and the max_stale_age feature.
   - It removes the bugs associated with negative_revalidating_lifetime
   because the feature is removed.

Disadvantages:

   - If someone would like a negative_revalidating_lifetime feature in the
   future, it will not exist and will need to be implemented.

Suggestion
I recommend Option 3: Remove negative_revalidating_lifetime. It's safe to
remove because the feature does not currently work and seemingly has never
worked; thus nothing is being taken away from ATS. On the other hand, the
buggy behavior and confusing documentation is alleviated by it being
removed. The downside to this option is that we won't have the added
negative_revalidating_lifetime feature in the event that someone wants it
in the future. But, again, that leaves us no worse off than we are today
since our current implementation doesn't effectively implement it.

Let me know if you have other thoughts or concerns.

Thanks,
Brian

-- 
"Come to Me, all who are weary and heavy-laden, and I will
give you rest. Take My yoke upon you and learn from Me, for
I am gentle and humble in heart, and you will find rest for
your souls. For My yoke is easy and My burden is light."

    ~ Matthew 11:28-30

Re: Design Review: How to handle negative_revalidating_lifetime

Posted by Sudheer Vinukonda <su...@yahoo.com.INVALID>.
Er, sorry that didn’t sound right. Agree with you that not updating max-age seems more like a bug if we are to go with the hypothesis.

> On Mar 26, 2021, at 3:02 PM, Sudheer Vinukonda <su...@yahoo.com> wrote:
> 
> 
> Thanks Brian! 
> 
> Hmm, it seems to me that the traces you shared actually further confirm the hypothesis - ie ATS serves a locally cached stale object because of negative revalidation (and Origin connection error/5xx), yet we do *not* want the clients receiving that copy (whether it's an actual browser/device or a CDN) to cache it. This makes sense because the problem is between ATS and its Origin, but, we want the clients to continue trying to refresh their copy of the object? 
> 
> 
> 
> On Friday, March 26, 2021, 01:01:15 PM PDT, Brian Neradt <br...@gmail.com> wrote:
> 
> 
> Thank you Sudheer. This is really helpful. That at least gives some
> justification for the current behavior: try to communicate the lifetime of
> the object to something in the future so the upstream clients or proxies
> don't think they need to revalidate their cached objects. It is
> insufficient, unfortunately, since the mechanism doesn't remove any max-age
> directives which would override the Expires header. Take for example these
> following responses, the first being what the origin Proxy Verifier replied
> with and with which the cache was populated:
> 
> [0]: Wrote 134 bytes in an HTTP/1 response to request with key 1 with
> response status 200:
> - "content-length": "32"
> - "cache-control": "max-age=2"
> - "date": "Thu, 25 Mar 2021 19:02:17 GMT"
> 
> Note the Date is for yesterday. It is thus immediately stale since the
> max-age is for 2 seconds.
> 
> This second is the response received by the client for this same object and
> was served out of ATS's cache. Proxy Verifier was configured to reply with
> a 503, thus ATS served this stale resource out of cache because of negative
> revalidating since max_stale_age is 7 days (the default value), so ATS
> thinks it's fresh enough:
> 
>   [0]: Received an HTTP/1 200 response for key 2 with headers:
> - "content-length": "32"
> - "cache-control": "max-age=2"
> - "date": "Thu, 25 Mar 2021 19:02:17 GMT"
> - "Expires": "Fri, 26 Mar 2021 19:44:34 GMT"
> - "Connection": "keep-alive"
> - "Server": "ATS/10.0.0"
> 
> You see the problem: the Expires is set to 30 minutes in the future (the
> default value of negative_revalidating_lifetime). But the Date: header is
> yesterday and the max-age directive is still present and says 2 seconds.
> Since the RFC specifies preference of max-age directives over the Expires
> header, upstream caches will still consider this stale. s-maxage would have
> this problem as well. An Age header field can also be problematic for
> different reasons.
> 
> I say this only to clarify in more detail the limitations of what is
> currently done with the Expires header. None of this invalidates your
> observations. I'm more comfortable with and prefer option 1 now: simply
> document the current behavior. We can at least explain its rationale while
> acknowledging briefly its limitations. It has been like this for years and
> no one has expressed problems it has caused. I'll put up a documentation PR.
> 
> Thanks again for your thoughts.
> 
> Brian
> 
> 
> On Thu, Mar 25, 2021 at 11:27 AM Sudheer Vinukonda
> <su...@yahoo.com.invalid> wrote:
> 
> >  Hi Brian,
> > Thank you for the detailed and excellent write up on the current behavior!
> > Just to play a bit of devil's advocate, I'm wondering whether the
> > `negative_revalidating_lifetime` is being interpreted incorrectly. I do
> > admit it's confusing given the name and the documentation (which can both
> > clearly use some improvement), but, one potential interpretation of the
> > current behavior could be:
> > Use `negative_revalidating_lifetime` as a delta extension to the object's
> > life following a request that resulted in a Origin error (connection or
> > 5xx). IOW, when configured, this timeout is used to not ping the Origin
> > again and assume the object is fresh in cache for that duration. Arguably,
> > this may be a way to minimize the load on the Origin that may be
> > potentially experiencing systemic issues.
> > The reason to still cap the overall object's life to
> > `proxy.config.http.cache.max_stale_age` also seems like a
> > reasonable/desirable behavior to me. This is a catchall upper bound on how
> > long one would consider the object in their cache is still valid beyond
> > which it is not usable anymore - meaning, at that point returning an error
> > to the client is better than serving that stale copy and causing other side
> > effects.
> > Tldr; I think if we adjusted the documentation to explain this somehow,
> > then the current behavior doesn't sound all that embarrassing to me.
> > PS: Just to be clear, I'm not saying we should not change the current
> > behavior in favor of a better alternative, but only trying to
> > speculate/explain the possible reasoning behind the behavior.
> > Thanks.
> > Sudheer
> >    On Wednesday, March 24, 2021, 02:13:42 PM PDT, Brian Neradt <
> > brian.neradt@gmail.com> wrote:
> >
> >  dev@trafficserver.apache.org:
> >
> > Context
> > This design review concerns
> > proxy.config.http.negative_revalidating_lifetime
> > <
> > https://docs.trafficserver.apache.org/en/latest/admin-guide/files/records.config.en.html#proxy-config-http-negative-revalidating-lifetime
> > >
> > and
> > is the result of investigating the feature for the following issue:
> >
> > https://github.com/apache/trafficserver/issues/7425
> >
> > Feature Overview
> > The ATS negative revalidation feature implements the option to serve stale
> > cached responses in the circumstance that the server is not reachable for
> > revalidation. This can happen due to connectivity issues against the
> > origin or the origin's service being down resulting in 5xx responses.
> >
> > As a clarification, do not conflate negative revalidation with the negative
> > caching feature. While they sound alike, the two features are orthogonal to
> > each other. The ATS negative caching feature allows for caching of negative
> > HTTP responses (such as 404 "Not Found"). By contrast, as summarized above,
> > negative revalidation allows for serving some cached response that is stale
> > when the server is not reachable for revalidation.
> >
> > negative_revalidating_lifetime issues
> > That's the negative revalidation feature: serve stale content when the
> > server is not accessible. This generally works as expected. The problem
> > lies in determining whether the stale resource is fresh enough via the
> > negative_revalidating_lifetime configuration for this feature. The
> > intention of this lifetime configuration was to provide a way for the user
> > to specify how old a stale resource can be when negative revalidation takes
> > place before it will no longer be served from the cache. Looking at the
> > documentation and the code for it, the *expected behavior* seemed to have
> > been the following:
> >
> >
> >  1. A client requests an item. The server responds with, say, a 200 OK
> >  and the ATS proxy caches the response.
> >  2. At some point in the future, a client requests this item again.
> >  3. ATS inspects the item and determines that it is stale.
> >  4. ATS attempts to validate its cached resource against the origin.
> >  5. Either a connection cannot be made to the origin or it receives a 5xx
> >  in response to its validation request.
> >  6. If negative revalidation is enabled, ATS compares the age of the
> >  stale resource against negative_revalidating_lifetime. It will then do
> > one
> >  of two things:
> >      1. If the age of the cached resource is younger than
> >      negative_revalidating_lifetime, consider it "fresh enough" and
> > reply to the
> >      origin with the cached response.
> >      2. If the age is older than negative_revalidating_lifetime, reply
> >      with a 5xx response.
> >
> > In reality step 6 does not happen. The stale resource is never re-evaluated
> > as fresh or still stale against negative_revalidating_lifetime. Rather, the
> > determining factor of whether the resource is served is
> > proxy.config.http.cache.max_stale_age
> > <
> > https://docs.trafficserver.apache.org/en/latest/admin-guide/files/records.config.en.html#proxy-config-http-cache-max-stale-age
> > >.
> > That is, if its age is less than max_stale_age, it is used; otherwise it is
> > not. In addition, an incorrectly calculated Expires header field is added
> > which is set to negative_revalidating_lifetime seconds from the current
> > time rather than negative_revalidating_lifetime seconds from the resources
> > calculated generation time.
> >
> > These negative_revalidating_lifetime issues may very well be day zero bugs.
> > Looking at git, it goes back to at least 2009 in a commit
> > <
> > https://github.com/apache/trafficserver/commit/a1651345d11f228dfb392cc79169760e80d25f06
> > >
> > titled "Initial commit" (presumably when we transitioned source code
> > repositories). It has not worked for at least a very long time and people
> > have been relying, knowingly or not, upon max_stale_age to limit how stale
> > resources can be before they are used for negative revalidation.
> >
> > This design review evaluates the options for how to handle
> > negative_revalidating_lifetime.
> >
> > Options
> > Option 1: Document the Current Behavior
> > The first option is to leave the feature as is and document how
> > negative_revalidating_lifetime currently behaves.
> >
> > Advantages:
> >
> >  - This requires the least amount of work as it only involves a
> >  documentation change.
> >  - This will effectively avoid user confusion about
> >  negative_revalidating_lifetime so that no further issues will be filed
> > and
> >  time wasted when it does not behave as expected.
> >
> > Disadvantages
> >
> >  - negative_revalidating_lifetime, as it currently behaves, would be hard
> >  to explain without embarrassing the project. We would have to say
> > something
> >  like:
> >
> > The value of negative_revalidating_lifetime is used to calculate an Expires
> > > header with a field value of negative_revalidating_lifetime seconds in
> > the
> > > future from the time the cached resource is served. This header is added
> > in
> > > the cached response. No other behavior is influenced by this
> > configuration.
> > > The age limit enforced for such stale resources is configured
> > > via proxy.config.http.cache.max_stale_age.
> >
> >
> > This likely raises more questions than it answers. In reality, it would
> > probably be most helpful to steer users away from this feature and toward
> > max_stale_age.
> >
> >
> > Option 2: Fix the Issues
> > Alternatively, we can fix the issues surrounding
> > negative_revalidating_lifetime. Looking at the code, it seems that the
> > Expires header, which is currently incorrectly calculated, really isn't
> > needed. Rather we can make the negative_revalidating_lifetime conditionally
> > a part of the HttpTransact::is_stale_cache_response_returnable()
> > <
> > https://github.com/apache/trafficserver/blob/f639621e238f778ba6f18aaa5de47772d6504e55/proxy/http/HttpTransact.cc#L5943
> > >
> > function.
> >
> > In addition to max_stale_age, which is currently taken into account there,
> > we can also compare against negative_revalidating_lifetime to see whether
> > the resource can be served (but only when negative revalidation is being
> > used...the function is used in other contexts).
> >
> > Advantages:
> >
> >  - This addresses both the known issues with
> >  negative_revalidating_lifetime in that the Expires header is no longer
> >  added and the negative_revalidating_lifetime is effectively used in
> >  determining whether the item is fresh enough for the response.
> >  - The fix doesn't look all that hard to implement.
> >
> > Disadvantages:
> >
> >  - Implementing the feature adds various complexities to ATS. It
> >  complicates the documentation, it complicates the code, and will require
> >  tests to verify it works correctly. None of this complexity is large if
> > the
> >  feature is worth it, and I'm happy to do it. But I have my doubts that it
> >  is worth it. Keep in mind that negative_revalidating_lifetime has never
> >  been a working feature in ATS and this has caused problems for no one. If
> >  its absence had caused problems, we would know about it because people
> >  would complain. In fact, the only reported issue with the feature is the
> >  confusion it caused by existing in the document in the first place. Also,
> >  keep in mind that this feature would only be useful, if it is at all, to
> >  *extend* the lifetime of these cached resources past max_stale_age. Both
> >  of those values would be used and we would want to use the max of each.
> >  Currently max_stale_age defaults to 7 days. How many users will want
> >  max_stale_age to be one value for general max-age computations, but use a
> >  different, larger, value for negative revalidation? I'm guessing not
> > many.
> >  (Alternatively, I suppose we could use only
> > negative_revalidating_lifetime
> >  instead of using both it and max_stale_age in these situations, but that
> >  would potentially be a breaking change for current configurations in
> > which
> >  negative_revalidation would now only happen for the default 30 minutes of
> >  negative_revalidating_lifetime rather than the default 7 days of
> >  max_stale_age.)
> >
> >
> > Option 3: Remove negative_revalidating_lifetime
> > We can remove negative_revalidating_lifetime from the product. This would
> > involve:
> >
> >  - Removing the documentation of negative_revalidating_lifetime
> >  <
> > https://docs.trafficserver.apache.org/en/latest/admin-guide/files/records.config.en.html#proxy-config-http-negative-revalidating-lifetime
> > >
> >  from our records.config documentation. We would also want to explain the
> >  role that max_stale_age plays in limiting the usage of stale resources in
> >  negative revalidating circumstances.
> >  - Removing the addition of the faulty Expires header from HttpTransact.cc
> >  <
> > https://github.com/apache/trafficserver/blob/f639621e238f778ba6f18aaa5de47772d6504e55/proxy/http/HttpTransact.cc#L4364
> > >
> >  - Keep the parsing of negative_revalidating_lifetime, if it is used by a
> >  user, but add a Warning that it is defunct, deprecated, and will be
> > removed
> >  in a future release.
> >
> > Advantages:
> >
> >  - This removes the confusion associated with having both a
> >  negative_revalidating_lifetime feature used for these negative
> > revalidating
> >  resources and the max_stale_age feature.
> >  - It removes the bugs associated with negative_revalidating_lifetime
> >  because the feature is removed.
> >
> > Disadvantages:
> >
> >  - If someone would like a negative_revalidating_lifetime feature in the
> >  future, it will not exist and will need to be implemented.
> >
> > Suggestion
> > I recommend Option 3: Remove negative_revalidating_lifetime. It's safe to
> > remove because the feature does not currently work and seemingly has never
> > worked; thus nothing is being taken away from ATS. On the other hand, the
> > buggy behavior and confusing documentation is alleviated by it being
> > removed. The downside to this option is that we won't have the added
> > negative_revalidating_lifetime feature in the event that someone wants it
> > in the future. But, again, that leaves us no worse off than we are today
> > since our current implementation doesn't effectively implement it.
> >
> > Let me know if you have other thoughts or concerns.
> >
> > Thanks,
> > Brian
> >
> > --
> > "Come to Me, all who are weary and heavy-laden, and I will
> > give you rest. Take My yoke upon you and learn from Me, for
> > I am gentle and humble in heart, and you will find rest for
> > your souls. For My yoke is easy and My burden is light."
> >
> >    ~ Matthew 11:28-30
> 
> >
> 
> 
> 
> -- 
> "Come to Me, all who are weary and heavy-laden, and I will
> give you rest. Take My yoke upon you and learn from Me, for
> I am gentle and humble in heart, and you will find rest for
> your souls. For My yoke is easy and My burden is light."
> 
>     ~ Matthew 11:28-30

Re: Design Review: How to handle negative_revalidating_lifetime

Posted by Sudheer Vinukonda <su...@yahoo.com.INVALID>.
 Thanks Brian! 
Hmm, it seems to me that the traces you shared actually further confirm the hypothesis - ie ATS serves a locally cached stale object because of negative revalidation (and Origin connection error/5xx), yet we do *not* want the clients receiving that copy (whether it's an actual browser/device or a CDN) to cache it. This makes sense because the problem is between ATS and its Origin, but, we want the clients to continue trying to refresh their copy of the object? 


    On Friday, March 26, 2021, 01:01:15 PM PDT, Brian Neradt <br...@gmail.com> wrote:  
 
 Thank you Sudheer. This is really helpful. That at least gives some
justification for the current behavior: try to communicate the lifetime of
the object to something in the future so the upstream clients or proxies
don't think they need to revalidate their cached objects. It is
insufficient, unfortunately, since the mechanism doesn't remove any max-age
directives which would override the Expires header. Take for example these
following responses, the first being what the origin Proxy Verifier replied
with and with which the cache was populated:

 [0]: Wrote 134 bytes in an HTTP/1 response to request with key 1 with
response status 200:
- "content-length": "32"
- "cache-control": "max-age=2"
- "date": "Thu, 25 Mar 2021 19:02:17 GMT"

Note the Date is for yesterday. It is thus immediately stale since the
max-age is for 2 seconds.

This second is the response received by the client for this same object and
was served out of ATS's cache. Proxy Verifier was configured to reply with
a 503, thus ATS served this stale resource out of cache because of negative
revalidating since max_stale_age is 7 days (the default value), so ATS
thinks it's fresh enough:

  [0]: Received an HTTP/1 200 response for key 2 with headers:
- "content-length": "32"
- "cache-control": "max-age=2"
- "date": "Thu, 25 Mar 2021 19:02:17 GMT"
- "Expires": "Fri, 26 Mar 2021 19:44:34 GMT"
- "Connection": "keep-alive"
- "Server": "ATS/10.0.0"

You see the problem: the Expires is set to 30 minutes in the future (the
default value of negative_revalidating_lifetime). But the Date: header is
yesterday and the max-age directive is still present and says 2 seconds.
Since the RFC specifies preference of max-age directives over the Expires
header, upstream caches will still consider this stale. s-maxage would have
this problem as well. An Age header field can also be problematic for
different reasons.

I say this only to clarify in more detail the limitations of what is
currently done with the Expires header. None of this invalidates your
observations. I'm more comfortable with and prefer option 1 now: simply
document the current behavior. We can at least explain its rationale while
acknowledging briefly its limitations. It has been like this for years and
no one has expressed problems it has caused. I'll put up a documentation PR.

Thanks again for your thoughts.

Brian


On Thu, Mar 25, 2021 at 11:27 AM Sudheer Vinukonda
<su...@yahoo.com.invalid> wrote:

>  Hi Brian,
> Thank you for the detailed and excellent write up on the current behavior!
> Just to play a bit of devil's advocate, I'm wondering whether the
> `negative_revalidating_lifetime` is being interpreted incorrectly. I do
> admit it's confusing given the name and the documentation (which can both
> clearly use some improvement), but, one potential interpretation of the
> current behavior could be:
> Use `negative_revalidating_lifetime` as a delta extension to the object's
> life following a request that resulted in a Origin error (connection or
> 5xx). IOW, when configured, this timeout is used to not ping the Origin
> again and assume the object is fresh in cache for that duration. Arguably,
> this may be a way to minimize the load on the Origin that may be
> potentially experiencing systemic issues.
> The reason to still cap the overall object's life to
> `proxy.config.http.cache.max_stale_age` also seems like a
> reasonable/desirable behavior to me. This is a catchall upper bound on how
> long one would consider the object in their cache is still valid beyond
> which it is not usable anymore - meaning, at that point returning an error
> to the client is better than serving that stale copy and causing other side
> effects.
> Tldr; I think if we adjusted the documentation to explain this somehow,
> then the current behavior doesn't sound all that embarrassing to me.
> PS: Just to be clear, I'm not saying we should not change the current
> behavior in favor of a better alternative, but only trying to
> speculate/explain the possible reasoning behind the behavior.
> Thanks.
> Sudheer
>    On Wednesday, March 24, 2021, 02:13:42 PM PDT, Brian Neradt <
> brian.neradt@gmail.com> wrote:
>
>  dev@trafficserver.apache.org:
>
> Context
> This design review concerns
> proxy.config.http.negative_revalidating_lifetime
> <
> https://docs.trafficserver.apache.org/en/latest/admin-guide/files/records.config.en.html#proxy-config-http-negative-revalidating-lifetime
> >
> and
> is the result of investigating the feature for the following issue:
>
> https://github.com/apache/trafficserver/issues/7425
>
> Feature Overview
> The ATS negative revalidation feature implements the option to serve stale
> cached responses in the circumstance that the server is not reachable for
> revalidation. This can happen due to connectivity issues against the
> origin or the origin's service being down resulting in 5xx responses.
>
> As a clarification, do not conflate negative revalidation with the negative
> caching feature. While they sound alike, the two features are orthogonal to
> each other. The ATS negative caching feature allows for caching of negative
> HTTP responses (such as 404 "Not Found"). By contrast, as summarized above,
> negative revalidation allows for serving some cached response that is stale
> when the server is not reachable for revalidation.
>
> negative_revalidating_lifetime issues
> That's the negative revalidation feature: serve stale content when the
> server is not accessible. This generally works as expected. The problem
> lies in determining whether the stale resource is fresh enough via the
> negative_revalidating_lifetime configuration for this feature. The
> intention of this lifetime configuration was to provide a way for the user
> to specify how old a stale resource can be when negative revalidation takes
> place before it will no longer be served from the cache. Looking at the
> documentation and the code for it, the *expected behavior* seemed to have
> been the following:
>
>
>  1. A client requests an item. The server responds with, say, a 200 OK
>  and the ATS proxy caches the response.
>  2. At some point in the future, a client requests this item again.
>  3. ATS inspects the item and determines that it is stale.
>  4. ATS attempts to validate its cached resource against the origin.
>  5. Either a connection cannot be made to the origin or it receives a 5xx
>  in response to its validation request.
>  6. If negative revalidation is enabled, ATS compares the age of the
>  stale resource against negative_revalidating_lifetime. It will then do
> one
>  of two things:
>      1. If the age of the cached resource is younger than
>      negative_revalidating_lifetime, consider it "fresh enough" and
> reply to the
>      origin with the cached response.
>      2. If the age is older than negative_revalidating_lifetime, reply
>      with a 5xx response.
>
> In reality step 6 does not happen. The stale resource is never re-evaluated
> as fresh or still stale against negative_revalidating_lifetime. Rather, the
> determining factor of whether the resource is served is
> proxy.config.http.cache.max_stale_age
> <
> https://docs.trafficserver.apache.org/en/latest/admin-guide/files/records.config.en.html#proxy-config-http-cache-max-stale-age
> >.
> That is, if its age is less than max_stale_age, it is used; otherwise it is
> not. In addition, an incorrectly calculated Expires header field is added
> which is set to negative_revalidating_lifetime seconds from the current
> time rather than negative_revalidating_lifetime seconds from the resources
> calculated generation time.
>
> These negative_revalidating_lifetime issues may very well be day zero bugs.
> Looking at git, it goes back to at least 2009 in a commit
> <
> https://github.com/apache/trafficserver/commit/a1651345d11f228dfb392cc79169760e80d25f06
> >
> titled "Initial commit" (presumably when we transitioned source code
> repositories). It has not worked for at least a very long time and people
> have been relying, knowingly or not, upon max_stale_age to limit how stale
> resources can be before they are used for negative revalidation.
>
> This design review evaluates the options for how to handle
> negative_revalidating_lifetime.
>
> Options
> Option 1: Document the Current Behavior
> The first option is to leave the feature as is and document how
> negative_revalidating_lifetime currently behaves.
>
> Advantages:
>
>  - This requires the least amount of work as it only involves a
>  documentation change.
>  - This will effectively avoid user confusion about
>  negative_revalidating_lifetime so that no further issues will be filed
> and
>  time wasted when it does not behave as expected.
>
> Disadvantages
>
>  - negative_revalidating_lifetime, as it currently behaves, would be hard
>  to explain without embarrassing the project. We would have to say
> something
>  like:
>
> The value of negative_revalidating_lifetime is used to calculate an Expires
> > header with a field value of negative_revalidating_lifetime seconds in
> the
> > future from the time the cached resource is served. This header is added
> in
> > the cached response. No other behavior is influenced by this
> configuration.
> > The age limit enforced for such stale resources is configured
> > via proxy.config.http.cache.max_stale_age.
>
>
> This likely raises more questions than it answers. In reality, it would
> probably be most helpful to steer users away from this feature and toward
> max_stale_age.
>
>
> Option 2: Fix the Issues
> Alternatively, we can fix the issues surrounding
> negative_revalidating_lifetime. Looking at the code, it seems that the
> Expires header, which is currently incorrectly calculated, really isn't
> needed. Rather we can make the negative_revalidating_lifetime conditionally
> a part of the HttpTransact::is_stale_cache_response_returnable()
> <
> https://github.com/apache/trafficserver/blob/f639621e238f778ba6f18aaa5de47772d6504e55/proxy/http/HttpTransact.cc#L5943
> >
> function.
>
> In addition to max_stale_age, which is currently taken into account there,
> we can also compare against negative_revalidating_lifetime to see whether
> the resource can be served (but only when negative revalidation is being
> used...the function is used in other contexts).
>
> Advantages:
>
>  - This addresses both the known issues with
>  negative_revalidating_lifetime in that the Expires header is no longer
>  added and the negative_revalidating_lifetime is effectively used in
>  determining whether the item is fresh enough for the response.
>  - The fix doesn't look all that hard to implement.
>
> Disadvantages:
>
>  - Implementing the feature adds various complexities to ATS. It
>  complicates the documentation, it complicates the code, and will require
>  tests to verify it works correctly. None of this complexity is large if
> the
>  feature is worth it, and I'm happy to do it. But I have my doubts that it
>  is worth it. Keep in mind that negative_revalidating_lifetime has never
>  been a working feature in ATS and this has caused problems for no one. If
>  its absence had caused problems, we would know about it because people
>  would complain. In fact, the only reported issue with the feature is the
>  confusion it caused by existing in the document in the first place. Also,
>  keep in mind that this feature would only be useful, if it is at all, to
>  *extend* the lifetime of these cached resources past max_stale_age. Both
>  of those values would be used and we would want to use the max of each.
>  Currently max_stale_age defaults to 7 days. How many users will want
>  max_stale_age to be one value for general max-age computations, but use a
>  different, larger, value for negative revalidation? I'm guessing not
> many.
>  (Alternatively, I suppose we could use only
> negative_revalidating_lifetime
>  instead of using both it and max_stale_age in these situations, but that
>  would potentially be a breaking change for current configurations in
> which
>  negative_revalidation would now only happen for the default 30 minutes of
>  negative_revalidating_lifetime rather than the default 7 days of
>  max_stale_age.)
>
>
> Option 3: Remove negative_revalidating_lifetime
> We can remove negative_revalidating_lifetime from the product. This would
> involve:
>
>  - Removing the documentation of negative_revalidating_lifetime
>  <
> https://docs.trafficserver.apache.org/en/latest/admin-guide/files/records.config.en.html#proxy-config-http-negative-revalidating-lifetime
> >
>  from our records.config documentation. We would also want to explain the
>  role that max_stale_age plays in limiting the usage of stale resources in
>  negative revalidating circumstances.
>  - Removing the addition of the faulty Expires header from HttpTransact.cc
>  <
> https://github.com/apache/trafficserver/blob/f639621e238f778ba6f18aaa5de47772d6504e55/proxy/http/HttpTransact.cc#L4364
> >
>  - Keep the parsing of negative_revalidating_lifetime, if it is used by a
>  user, but add a Warning that it is defunct, deprecated, and will be
> removed
>  in a future release.
>
> Advantages:
>
>  - This removes the confusion associated with having both a
>  negative_revalidating_lifetime feature used for these negative
> revalidating
>  resources and the max_stale_age feature.
>  - It removes the bugs associated with negative_revalidating_lifetime
>  because the feature is removed.
>
> Disadvantages:
>
>  - If someone would like a negative_revalidating_lifetime feature in the
>  future, it will not exist and will need to be implemented.
>
> Suggestion
> I recommend Option 3: Remove negative_revalidating_lifetime. It's safe to
> remove because the feature does not currently work and seemingly has never
> worked; thus nothing is being taken away from ATS. On the other hand, the
> buggy behavior and confusing documentation is alleviated by it being
> removed. The downside to this option is that we won't have the added
> negative_revalidating_lifetime feature in the event that someone wants it
> in the future. But, again, that leaves us no worse off than we are today
> since our current implementation doesn't effectively implement it.
>
> Let me know if you have other thoughts or concerns.
>
> Thanks,
> Brian
>
> --
> "Come to Me, all who are weary and heavy-laden, and I will
> give you rest. Take My yoke upon you and learn from Me, for
> I am gentle and humble in heart, and you will find rest for
> your souls. For My yoke is easy and My burden is light."
>
>    ~ Matthew 11:28-30
>



-- 
"Come to Me, all who are weary and heavy-laden, and I will
give you rest. Take My yoke upon you and learn from Me, for
I am gentle and humble in heart, and you will find rest for
your souls. For My yoke is easy and My burden is light."

    ~ Matthew 11:28-30
  

Re: Design Review: How to handle negative_revalidating_lifetime

Posted by Brian Neradt <br...@gmail.com>.
Thank you Sudheer. This is really helpful. That at least gives some
justification for the current behavior: try to communicate the lifetime of
the object to something in the future so the upstream clients or proxies
don't think they need to revalidate their cached objects. It is
insufficient, unfortunately, since the mechanism doesn't remove any max-age
directives which would override the Expires header. Take for example these
following responses, the first being what the origin Proxy Verifier replied
with and with which the cache was populated:

 [0]: Wrote 134 bytes in an HTTP/1 response to request with key 1 with
response status 200:
- "content-length": "32"
- "cache-control": "max-age=2"
- "date": "Thu, 25 Mar 2021 19:02:17 GMT"

Note the Date is for yesterday. It is thus immediately stale since the
max-age is for 2 seconds.

This second is the response received by the client for this same object and
was served out of ATS's cache. Proxy Verifier was configured to reply with
a 503, thus ATS served this stale resource out of cache because of negative
revalidating since max_stale_age is 7 days (the default value), so ATS
thinks it's fresh enough:

   [0]: Received an HTTP/1 200 response for key 2 with headers:
- "content-length": "32"
- "cache-control": "max-age=2"
- "date": "Thu, 25 Mar 2021 19:02:17 GMT"
- "Expires": "Fri, 26 Mar 2021 19:44:34 GMT"
- "Connection": "keep-alive"
- "Server": "ATS/10.0.0"

You see the problem: the Expires is set to 30 minutes in the future (the
default value of negative_revalidating_lifetime). But the Date: header is
yesterday and the max-age directive is still present and says 2 seconds.
Since the RFC specifies preference of max-age directives over the Expires
header, upstream caches will still consider this stale. s-maxage would have
this problem as well. An Age header field can also be problematic for
different reasons.

I say this only to clarify in more detail the limitations of what is
currently done with the Expires header. None of this invalidates your
observations. I'm more comfortable with and prefer option 1 now: simply
document the current behavior. We can at least explain its rationale while
acknowledging briefly its limitations. It has been like this for years and
no one has expressed problems it has caused. I'll put up a documentation PR.

Thanks again for your thoughts.

Brian


On Thu, Mar 25, 2021 at 11:27 AM Sudheer Vinukonda
<su...@yahoo.com.invalid> wrote:

>  Hi Brian,
> Thank you for the detailed and excellent write up on the current behavior!
> Just to play a bit of devil's advocate, I'm wondering whether the
> `negative_revalidating_lifetime` is being interpreted incorrectly. I do
> admit it's confusing given the name and the documentation (which can both
> clearly use some improvement), but, one potential interpretation of the
> current behavior could be:
> Use `negative_revalidating_lifetime` as a delta extension to the object's
> life following a request that resulted in a Origin error (connection or
> 5xx). IOW, when configured, this timeout is used to not ping the Origin
> again and assume the object is fresh in cache for that duration. Arguably,
> this may be a way to minimize the load on the Origin that may be
> potentially experiencing systemic issues.
> The reason to still cap the overall object's life to
> `proxy.config.http.cache.max_stale_age` also seems like a
> reasonable/desirable behavior to me. This is a catchall upper bound on how
> long one would consider the object in their cache is still valid beyond
> which it is not usable anymore - meaning, at that point returning an error
> to the client is better than serving that stale copy and causing other side
> effects.
> Tldr; I think if we adjusted the documentation to explain this somehow,
> then the current behavior doesn't sound all that embarrassing to me.
> PS: Just to be clear, I'm not saying we should not change the current
> behavior in favor of a better alternative, but only trying to
> speculate/explain the possible reasoning behind the behavior.
> Thanks.
> Sudheer
>     On Wednesday, March 24, 2021, 02:13:42 PM PDT, Brian Neradt <
> brian.neradt@gmail.com> wrote:
>
>  dev@trafficserver.apache.org:
>
> Context
> This design review concerns
> proxy.config.http.negative_revalidating_lifetime
> <
> https://docs.trafficserver.apache.org/en/latest/admin-guide/files/records.config.en.html#proxy-config-http-negative-revalidating-lifetime
> >
> and
> is the result of investigating the feature for the following issue:
>
> https://github.com/apache/trafficserver/issues/7425
>
> Feature Overview
> The ATS negative revalidation feature implements the option to serve stale
> cached responses in the circumstance that the server is not reachable for
> revalidation. This can happen due to connectivity issues against the
> origin or the origin's service being down resulting in 5xx responses.
>
> As a clarification, do not conflate negative revalidation with the negative
> caching feature. While they sound alike, the two features are orthogonal to
> each other. The ATS negative caching feature allows for caching of negative
> HTTP responses (such as 404 "Not Found"). By contrast, as summarized above,
> negative revalidation allows for serving some cached response that is stale
> when the server is not reachable for revalidation.
>
> negative_revalidating_lifetime issues
> That's the negative revalidation feature: serve stale content when the
> server is not accessible. This generally works as expected. The problem
> lies in determining whether the stale resource is fresh enough via the
> negative_revalidating_lifetime configuration for this feature. The
> intention of this lifetime configuration was to provide a way for the user
> to specify how old a stale resource can be when negative revalidation takes
> place before it will no longer be served from the cache. Looking at the
> documentation and the code for it, the *expected behavior* seemed to have
> been the following:
>
>
>   1. A client requests an item. The server responds with, say, a 200 OK
>   and the ATS proxy caches the response.
>   2. At some point in the future, a client requests this item again.
>   3. ATS inspects the item and determines that it is stale.
>   4. ATS attempts to validate its cached resource against the origin.
>   5. Either a connection cannot be made to the origin or it receives a 5xx
>   in response to its validation request.
>   6. If negative revalidation is enabled, ATS compares the age of the
>   stale resource against negative_revalidating_lifetime. It will then do
> one
>   of two things:
>       1. If the age of the cached resource is younger than
>       negative_revalidating_lifetime, consider it "fresh enough" and
> reply to the
>       origin with the cached response.
>       2. If the age is older than negative_revalidating_lifetime, reply
>       with a 5xx response.
>
> In reality step 6 does not happen. The stale resource is never re-evaluated
> as fresh or still stale against negative_revalidating_lifetime. Rather, the
> determining factor of whether the resource is served is
> proxy.config.http.cache.max_stale_age
> <
> https://docs.trafficserver.apache.org/en/latest/admin-guide/files/records.config.en.html#proxy-config-http-cache-max-stale-age
> >.
> That is, if its age is less than max_stale_age, it is used; otherwise it is
> not. In addition, an incorrectly calculated Expires header field is added
> which is set to negative_revalidating_lifetime seconds from the current
> time rather than negative_revalidating_lifetime seconds from the resources
> calculated generation time.
>
> These negative_revalidating_lifetime issues may very well be day zero bugs.
> Looking at git, it goes back to at least 2009 in a commit
> <
> https://github.com/apache/trafficserver/commit/a1651345d11f228dfb392cc79169760e80d25f06
> >
> titled "Initial commit" (presumably when we transitioned source code
> repositories). It has not worked for at least a very long time and people
> have been relying, knowingly or not, upon max_stale_age to limit how stale
> resources can be before they are used for negative revalidation.
>
> This design review evaluates the options for how to handle
> negative_revalidating_lifetime.
>
> Options
> Option 1: Document the Current Behavior
> The first option is to leave the feature as is and document how
> negative_revalidating_lifetime currently behaves.
>
> Advantages:
>
>   - This requires the least amount of work as it only involves a
>   documentation change.
>   - This will effectively avoid user confusion about
>   negative_revalidating_lifetime so that no further issues will be filed
> and
>   time wasted when it does not behave as expected.
>
> Disadvantages
>
>   - negative_revalidating_lifetime, as it currently behaves, would be hard
>   to explain without embarrassing the project. We would have to say
> something
>   like:
>
> The value of negative_revalidating_lifetime is used to calculate an Expires
> > header with a field value of negative_revalidating_lifetime seconds in
> the
> > future from the time the cached resource is served. This header is added
> in
> > the cached response. No other behavior is influenced by this
> configuration.
> > The age limit enforced for such stale resources is configured
> > via proxy.config.http.cache.max_stale_age.
>
>
> This likely raises more questions than it answers. In reality, it would
> probably be most helpful to steer users away from this feature and toward
> max_stale_age.
>
>
> Option 2: Fix the Issues
> Alternatively, we can fix the issues surrounding
> negative_revalidating_lifetime. Looking at the code, it seems that the
> Expires header, which is currently incorrectly calculated, really isn't
> needed. Rather we can make the negative_revalidating_lifetime conditionally
> a part of the HttpTransact::is_stale_cache_response_returnable()
> <
> https://github.com/apache/trafficserver/blob/f639621e238f778ba6f18aaa5de47772d6504e55/proxy/http/HttpTransact.cc#L5943
> >
> function.
>
> In addition to max_stale_age, which is currently taken into account there,
> we can also compare against negative_revalidating_lifetime to see whether
> the resource can be served (but only when negative revalidation is being
> used...the function is used in other contexts).
>
> Advantages:
>
>   - This addresses both the known issues with
>   negative_revalidating_lifetime in that the Expires header is no longer
>   added and the negative_revalidating_lifetime is effectively used in
>   determining whether the item is fresh enough for the response.
>   - The fix doesn't look all that hard to implement.
>
> Disadvantages:
>
>   - Implementing the feature adds various complexities to ATS. It
>   complicates the documentation, it complicates the code, and will require
>   tests to verify it works correctly. None of this complexity is large if
> the
>   feature is worth it, and I'm happy to do it. But I have my doubts that it
>   is worth it. Keep in mind that negative_revalidating_lifetime has never
>   been a working feature in ATS and this has caused problems for no one. If
>   its absence had caused problems, we would know about it because people
>   would complain. In fact, the only reported issue with the feature is the
>   confusion it caused by existing in the document in the first place. Also,
>   keep in mind that this feature would only be useful, if it is at all, to
>   *extend* the lifetime of these cached resources past max_stale_age. Both
>   of those values would be used and we would want to use the max of each.
>   Currently max_stale_age defaults to 7 days. How many users will want
>   max_stale_age to be one value for general max-age computations, but use a
>   different, larger, value for negative revalidation? I'm guessing not
> many.
>   (Alternatively, I suppose we could use only
> negative_revalidating_lifetime
>   instead of using both it and max_stale_age in these situations, but that
>   would potentially be a breaking change for current configurations in
> which
>   negative_revalidation would now only happen for the default 30 minutes of
>   negative_revalidating_lifetime rather than the default 7 days of
>   max_stale_age.)
>
>
> Option 3: Remove negative_revalidating_lifetime
> We can remove negative_revalidating_lifetime from the product. This would
> involve:
>
>   - Removing the documentation of negative_revalidating_lifetime
>   <
> https://docs.trafficserver.apache.org/en/latest/admin-guide/files/records.config.en.html#proxy-config-http-negative-revalidating-lifetime
> >
>   from our records.config documentation. We would also want to explain the
>   role that max_stale_age plays in limiting the usage of stale resources in
>   negative revalidating circumstances.
>   - Removing the addition of the faulty Expires header from HttpTransact.cc
>   <
> https://github.com/apache/trafficserver/blob/f639621e238f778ba6f18aaa5de47772d6504e55/proxy/http/HttpTransact.cc#L4364
> >
>   - Keep the parsing of negative_revalidating_lifetime, if it is used by a
>   user, but add a Warning that it is defunct, deprecated, and will be
> removed
>   in a future release.
>
> Advantages:
>
>   - This removes the confusion associated with having both a
>   negative_revalidating_lifetime feature used for these negative
> revalidating
>   resources and the max_stale_age feature.
>   - It removes the bugs associated with negative_revalidating_lifetime
>   because the feature is removed.
>
> Disadvantages:
>
>   - If someone would like a negative_revalidating_lifetime feature in the
>   future, it will not exist and will need to be implemented.
>
> Suggestion
> I recommend Option 3: Remove negative_revalidating_lifetime. It's safe to
> remove because the feature does not currently work and seemingly has never
> worked; thus nothing is being taken away from ATS. On the other hand, the
> buggy behavior and confusing documentation is alleviated by it being
> removed. The downside to this option is that we won't have the added
> negative_revalidating_lifetime feature in the event that someone wants it
> in the future. But, again, that leaves us no worse off than we are today
> since our current implementation doesn't effectively implement it.
>
> Let me know if you have other thoughts or concerns.
>
> Thanks,
> Brian
>
> --
> "Come to Me, all who are weary and heavy-laden, and I will
> give you rest. Take My yoke upon you and learn from Me, for
> I am gentle and humble in heart, and you will find rest for
> your souls. For My yoke is easy and My burden is light."
>
>     ~ Matthew 11:28-30
>



-- 
"Come to Me, all who are weary and heavy-laden, and I will
give you rest. Take My yoke upon you and learn from Me, for
I am gentle and humble in heart, and you will find rest for
your souls. For My yoke is easy and My burden is light."

    ~ Matthew 11:28-30

Re: Design Review: How to handle negative_revalidating_lifetime

Posted by Sudheer Vinukonda <su...@yahoo.com.INVALID>.
 Hi Brian,
Thank you for the detailed and excellent write up on the current behavior! 
Just to play a bit of devil's advocate, I'm wondering whether the `negative_revalidating_lifetime` is being interpreted incorrectly. I do admit it's confusing given the name and the documentation (which can both clearly use some improvement), but, one potential interpretation of the current behavior could be:
Use `negative_revalidating_lifetime` as a delta extension to the object's life following a request that resulted in a Origin error (connection or 5xx). IOW, when configured, this timeout is used to not ping the Origin again and assume the object is fresh in cache for that duration. Arguably, this may be a way to minimize the load on the Origin that may be potentially experiencing systemic issues. 
The reason to still cap the overall object's life to `proxy.config.http.cache.max_stale_age` also seems like a reasonable/desirable behavior to me. This is a catchall upper bound on how long one would consider the object in their cache is still valid beyond which it is not usable anymore - meaning, at that point returning an error to the client is better than serving that stale copy and causing other side effects.
Tldr; I think if we adjusted the documentation to explain this somehow, then the current behavior doesn't sound all that embarrassing to me.
PS: Just to be clear, I'm not saying we should not change the current behavior in favor of a better alternative, but only trying to speculate/explain the possible reasoning behind the behavior.
Thanks.
Sudheer
    On Wednesday, March 24, 2021, 02:13:42 PM PDT, Brian Neradt <br...@gmail.com> wrote:  
 
 dev@trafficserver.apache.org:

Context
This design review concerns proxy.config.http.negative_revalidating_lifetime
<https://docs.trafficserver.apache.org/en/latest/admin-guide/files/records.config.en.html#proxy-config-http-negative-revalidating-lifetime>
and
is the result of investigating the feature for the following issue:

https://github.com/apache/trafficserver/issues/7425

Feature Overview
The ATS negative revalidation feature implements the option to serve stale
cached responses in the circumstance that the server is not reachable for
revalidation. This can happen due to connectivity issues against the
origin or the origin's service being down resulting in 5xx responses.

As a clarification, do not conflate negative revalidation with the negative
caching feature. While they sound alike, the two features are orthogonal to
each other. The ATS negative caching feature allows for caching of negative
HTTP responses (such as 404 "Not Found"). By contrast, as summarized above,
negative revalidation allows for serving some cached response that is stale
when the server is not reachable for revalidation.

negative_revalidating_lifetime issues
That's the negative revalidation feature: serve stale content when the
server is not accessible. This generally works as expected. The problem
lies in determining whether the stale resource is fresh enough via the
negative_revalidating_lifetime configuration for this feature. The
intention of this lifetime configuration was to provide a way for the user
to specify how old a stale resource can be when negative revalidation takes
place before it will no longer be served from the cache. Looking at the
documentation and the code for it, the *expected behavior* seemed to have
been the following:


  1. A client requests an item. The server responds with, say, a 200 OK
  and the ATS proxy caches the response.
  2. At some point in the future, a client requests this item again.
  3. ATS inspects the item and determines that it is stale.
  4. ATS attempts to validate its cached resource against the origin.
  5. Either a connection cannot be made to the origin or it receives a 5xx
  in response to its validation request.
  6. If negative revalidation is enabled, ATS compares the age of the
  stale resource against negative_revalidating_lifetime. It will then do one
  of two things:
      1. If the age of the cached resource is younger than
      negative_revalidating_lifetime, consider it "fresh enough" and
reply to the
      origin with the cached response.
      2. If the age is older than negative_revalidating_lifetime, reply
      with a 5xx response.

In reality step 6 does not happen. The stale resource is never re-evaluated
as fresh or still stale against negative_revalidating_lifetime. Rather, the
determining factor of whether the resource is served is
proxy.config.http.cache.max_stale_age
<https://docs.trafficserver.apache.org/en/latest/admin-guide/files/records.config.en.html#proxy-config-http-cache-max-stale-age>.
That is, if its age is less than max_stale_age, it is used; otherwise it is
not. In addition, an incorrectly calculated Expires header field is added
which is set to negative_revalidating_lifetime seconds from the current
time rather than negative_revalidating_lifetime seconds from the resources
calculated generation time.

These negative_revalidating_lifetime issues may very well be day zero bugs.
Looking at git, it goes back to at least 2009 in a commit
<https://github.com/apache/trafficserver/commit/a1651345d11f228dfb392cc79169760e80d25f06>
titled "Initial commit" (presumably when we transitioned source code
repositories). It has not worked for at least a very long time and people
have been relying, knowingly or not, upon max_stale_age to limit how stale
resources can be before they are used for negative revalidation.

This design review evaluates the options for how to handle
negative_revalidating_lifetime.

Options
Option 1: Document the Current Behavior
The first option is to leave the feature as is and document how
negative_revalidating_lifetime currently behaves.

Advantages:

  - This requires the least amount of work as it only involves a
  documentation change.
  - This will effectively avoid user confusion about
  negative_revalidating_lifetime so that no further issues will be filed and
  time wasted when it does not behave as expected.

Disadvantages

  - negative_revalidating_lifetime, as it currently behaves, would be hard
  to explain without embarrassing the project. We would have to say something
  like:

The value of negative_revalidating_lifetime is used to calculate an Expires
> header with a field value of negative_revalidating_lifetime seconds in the
> future from the time the cached resource is served. This header is added in
> the cached response. No other behavior is influenced by this configuration.
> The age limit enforced for such stale resources is configured
> via proxy.config.http.cache.max_stale_age.


This likely raises more questions than it answers. In reality, it would
probably be most helpful to steer users away from this feature and toward
max_stale_age.


Option 2: Fix the Issues
Alternatively, we can fix the issues surrounding
negative_revalidating_lifetime. Looking at the code, it seems that the
Expires header, which is currently incorrectly calculated, really isn't
needed. Rather we can make the negative_revalidating_lifetime conditionally
a part of the HttpTransact::is_stale_cache_response_returnable()
<https://github.com/apache/trafficserver/blob/f639621e238f778ba6f18aaa5de47772d6504e55/proxy/http/HttpTransact.cc#L5943>
function.

In addition to max_stale_age, which is currently taken into account there,
we can also compare against negative_revalidating_lifetime to see whether
the resource can be served (but only when negative revalidation is being
used...the function is used in other contexts).

Advantages:

  - This addresses both the known issues with
  negative_revalidating_lifetime in that the Expires header is no longer
  added and the negative_revalidating_lifetime is effectively used in
  determining whether the item is fresh enough for the response.
  - The fix doesn't look all that hard to implement.

Disadvantages:

  - Implementing the feature adds various complexities to ATS. It
  complicates the documentation, it complicates the code, and will require
  tests to verify it works correctly. None of this complexity is large if the
  feature is worth it, and I'm happy to do it. But I have my doubts that it
  is worth it. Keep in mind that negative_revalidating_lifetime has never
  been a working feature in ATS and this has caused problems for no one. If
  its absence had caused problems, we would know about it because people
  would complain. In fact, the only reported issue with the feature is the
  confusion it caused by existing in the document in the first place. Also,
  keep in mind that this feature would only be useful, if it is at all, to
  *extend* the lifetime of these cached resources past max_stale_age. Both
  of those values would be used and we would want to use the max of each.
  Currently max_stale_age defaults to 7 days. How many users will want
  max_stale_age to be one value for general max-age computations, but use a
  different, larger, value for negative revalidation? I'm guessing not many.
  (Alternatively, I suppose we could use only negative_revalidating_lifetime
  instead of using both it and max_stale_age in these situations, but that
  would potentially be a breaking change for current configurations in which
  negative_revalidation would now only happen for the default 30 minutes of
  negative_revalidating_lifetime rather than the default 7 days of
  max_stale_age.)


Option 3: Remove negative_revalidating_lifetime
We can remove negative_revalidating_lifetime from the product. This would
involve:

  - Removing the documentation of negative_revalidating_lifetime
  <https://docs.trafficserver.apache.org/en/latest/admin-guide/files/records.config.en.html#proxy-config-http-negative-revalidating-lifetime>
  from our records.config documentation. We would also want to explain the
  role that max_stale_age plays in limiting the usage of stale resources in
  negative revalidating circumstances.
  - Removing the addition of the faulty Expires header from HttpTransact.cc
  <https://github.com/apache/trafficserver/blob/f639621e238f778ba6f18aaa5de47772d6504e55/proxy/http/HttpTransact.cc#L4364>
  - Keep the parsing of negative_revalidating_lifetime, if it is used by a
  user, but add a Warning that it is defunct, deprecated, and will be removed
  in a future release.

Advantages:

  - This removes the confusion associated with having both a
  negative_revalidating_lifetime feature used for these negative revalidating
  resources and the max_stale_age feature.
  - It removes the bugs associated with negative_revalidating_lifetime
  because the feature is removed.

Disadvantages:

  - If someone would like a negative_revalidating_lifetime feature in the
  future, it will not exist and will need to be implemented.

Suggestion
I recommend Option 3: Remove negative_revalidating_lifetime. It's safe to
remove because the feature does not currently work and seemingly has never
worked; thus nothing is being taken away from ATS. On the other hand, the
buggy behavior and confusing documentation is alleviated by it being
removed. The downside to this option is that we won't have the added
negative_revalidating_lifetime feature in the event that someone wants it
in the future. But, again, that leaves us no worse off than we are today
since our current implementation doesn't effectively implement it.

Let me know if you have other thoughts or concerns.

Thanks,
Brian

-- 
"Come to Me, all who are weary and heavy-laden, and I will
give you rest. Take My yoke upon you and learn from Me, for
I am gentle and humble in heart, and you will find rest for
your souls. For My yoke is easy and My burden is light."

    ~ Matthew 11:28-30