You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@trafficserver.apache.org by Acácio Centeno <ac...@azion.com> on 2014/03/20 19:28:45 UTC

How regex purge works?

Hello,

We're considering using ATS as a cache solution, but one of the features we
must have -- regex purge -- seems to be too slow for our purposes.

We read on [1] that the cache's in-RAM index does not store the URL:

"Data in HTTP headers cannot be examined without disk I/O. This includes
the original URL for the object. The cache key is not stored explicitly and
therefore cannot be reliably retrieved."

So when ATS receives a regex purge request, it must scan the disk for each
and every object in order to retrieve it's URL and compare it to the
pattern being searched. Is this assumption correct?

We're also assuming that this fact, that it must read from disc, is the
reason why regex purge is slow. We briefly looked the source code, and it
seems that ATS schedules a series of scans when it receives such a request,
and these scans are scheduled in a conservative way, probably to not slow
down the processing of other requests.

We searched the bug reports to see if there were reports concerning this.
We found issue TS-1323 [2] that seems to be about it, but we're not sure as
the issue's text was brief (it does not mention regex purge, but does
mention the fact that the object must be accessed).

Are our assumptions correct? If so, is there a plan to change the way it's
done (maybe have another index with the URLs, so the lookup would not have
to read from disk?)

[1]
https://trafficserver.readthedocs.org/en/latest/arch/cache/cache-arch.en.html
[2] https://issues.apache.org/jira/browse/TS-1323

Thanks in advance,
-- 
Acácio Centeno

Porto Alegre, Brasil + 55 51 3012 3005
Miami, USA + 1 305 704 8816

Quaisquer informações contidas neste e-mail e anexos podem ser
confidenciais e privilegiadas, protegidas por sigilo legal. Qualquer forma
de utilização deste documento depende de autorização do emissor, sujeito as
penalidades cabíveis.

Any information in this e-mail and attachments may be confidential and
privileged, protected by legal confidentiality. The use of this document
require authorization by the issuer, subject to penalties.

Re: How regex purge works?

Posted by Phil Sorber <so...@apache.org>.

I have such a plugin that I just got authorized to commit. I will do that
today. Basically it's a regex ban list.


On Thu, Mar 20, 2014 at 2:18 PM, Leif Hedstrom <zw...@apache.org> wrote:

>
> On Mar 20, 2014, at 1:11 PM, Tomasz Kuzemko <to...@kuzemko.net> wrote:
>
> > An alternative solution would be to implement an additional layer
> similar to how Varnish provides "ban lists". The name is a little
> misleading and refers to banning content from being served from the cache.
> In a nutshell, Varnish keeps a list of bans which can be regexes. Each
> cache hit is then checked against all _newer_ bans and reconsidered as a
> miss in case of match.
>
>
> Agreed. And such a layer can (and probably should be) implemented as a
> plugin. Or alternatively, you can have a plugin that keeps meta data around
> what is being written to the cache, and "regex" purge based on that. Either
> way, it allows us to keep the core cache lean and mean.
>
> -- Leif
>
>

Re: How regex purge works?

Posted by Leif Hedstrom <zw...@apache.org>.

On Mar 20, 2014, at 1:11 PM, Tomasz Kuzemko <to...@kuzemko.net> wrote:

> An alternative solution would be to implement an additional layer similar to how Varnish provides "ban lists". The name is a little misleading and refers to banning content from being served from the cache. In a nutshell, Varnish keeps a list of bans which can be regexes. Each cache hit is then checked against all _newer_ bans and reconsidered as a miss in case of match.


Agreed. And such a layer can (and probably should be) implemented as a plugin. Or alternatively, you can have a plugin that keeps meta data around what is being written to the cache, and “regex” purge based on that. Either way, it allows us to keep the core cache lean and mean.

— Leif

Re: How regex purge works?

Posted by Tomasz Kuzemko <to...@kuzemko.net>.

An alternative solution would be to implement an additional layer 
similar to how Varnish provides "ban lists". The name is a little 
misleading and refers to banning content from being served from the 
cache. In a nutshell, Varnish keeps a list of bans which can be regexes. 
Each cache hit is then checked against all _newer_ bans and reconsidered 
as a miss in case of match.

Additionally a ban-lurker thread is scanning the cache in the background 
looking for objects which match bans and evict them. It can help keep 
the ban list at a reasonable size.

You can read more about it in Varnish tutorial: 
https://www.varnish-cache.org/docs/3.0/tutorial/purging.html#bans

This method has it drawbacks, particularly if there is a lot of 
long-lived objects in the cache. Still, it gives more flexibility to admins.

Best regards,
Tomasz Kuzemko
tomasz@kuzemko.net

W dniu 20.03.2014 20:48, Alan M. Carroll pisze:
> Thursday, March 20, 2014, 1:28:45 PM, you wrote:
>
>> So when ATS receives a regex purge request, it must scan the disk for each
>> and every object in order to retrieve it's URL and compare it to the
>> pattern being searched. Is this assumption correct?
>
> Yes.
>
>> Are our assumptions correct? If so, is there a plan to change the way it's
>> done (maybe have another index with the URLs, so the lookup would not have
>> to read from disk?)
>
> We are working on a general upgrade to the cache API to enable plugins to do more, but that's not going to be done in the near future and even then it's very unlikely we will change this aspect. The problem is this increases the memory requirements massively and currently we have people who are having problems with sufficient ram for their (multi-hundred terabyte) caches.
>

Re: How regex purge works?

Posted by "Alan M. Carroll" <am...@network-geographics.com>.

Thursday, March 20, 2014, 1:28:45 PM, you wrote:

> So when ATS receives a regex purge request, it must scan the disk for each
> and every object in order to retrieve it's URL and compare it to the
> pattern being searched. Is this assumption correct?

Yes.

> Are our assumptions correct? If so, is there a plan to change the way it's
> done (maybe have another index with the URLs, so the lookup would not have
> to read from disk?)

We are working on a general upgrade to the cache API to enable plugins to do more, but that's not going to be done in the near future and even then it's very unlikely we will change this aspect. The problem is this increases the memory requirements massively and currently we have people who are having problems with sufficient ram for their (multi-hundred terabyte) caches.