You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@trafficserver.apache.org by "Leif Hedstrom (JIRA)" <ji...@apache.org> on 2016/02/04 19:55:46 UTC
[jira] [Closed] (TS-3549) Configurable option to avoid thundering herd due to concurrent requests for the same object

     [ https://issues.apache.org/jira/browse/TS-3549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Leif Hedstrom closed TS-3549.
-----------------------------

> Configurable option to avoid thundering herd due to concurrent requests for the same object
> -------------------------------------------------------------------------------------------
>
>                 Key: TS-3549
>                 URL: https://issues.apache.org/jira/browse/TS-3549
>             Project: Traffic Server
>          Issue Type: New Feature
>          Components: HTTP
>    Affects Versions: 5.3.0
>            Reporter: Sudheer Vinukonda
>            Assignee: Sudheer Vinukonda
>             Fix For: 6.0.0
>
>         Attachments: TS-3549.diff
>
>
> When ATS is used as a delivery server for a video live streaming event, it's possible that there are a huge number of concurrent requests for the same object. Depending on the type of the object being requested, the cache lookup for those objects can result in either a stale copy of the object (e.g manifest files) or a complete cache miss (e.g segment files). ATS currently supports different types of connection collapse (e.g. *read-while-write* functionality - *https://docs.trafficserver.apache.org/en/latest/admin/http-proxy-caching.en.html#read-while-writer*, swr etc) but, in order for the *rww* to kick-in, ATS requires the complete response headers for the object be received and validated. In other words, until this happens, any number of incoming requests for the same object that result in a cache miss or a cache stale would be forwarded to the origin. For a scenario such as a live event, this leaves a sufficiently significant window, where there could be 100's of requests being forwarded to the origin for the same object. It has been observed during production that this results in significant increase in latency for the objects waiting in read-while-write state. 
> Note that, there are also a couple of settings *proxy.config.http.cache.open_read_retry_time* and *proxy.config.http.cache.max_open_read_retries* (*https://docs.trafficserver.apache.org/en/latest/admin/http-proxy-caching.en.html#open-read-retry-timeout*) that can alleviate the thundering herd to some extent, by re-trying to get the read lock for the object as configured. With these configured, ATS would retry to get the read lock for as long and if it's still not available due to the write lock being held by the first request that was forwarded to the origin (for e.g. the response headers have not been received yet), then all the waiting requests would simply be forwarded to the origin (by disabling cache for each of them). 
> It is almost impossible to get the above settings accurate to help in all possible situations (traffic, concurrent connections, network conditions etc). Due to this reason, a configurable workaround is proposed below that avoids the thundering herd completely. The patch below is mainly from [~jlaue] and [~psudaemon] with some additional clean up, configuration control and debug headers etc.
> Basically, when configured, on failing to obtain a write lock for an object (which means, there's another ongoing parallel request for the same object that was forwarded to the origin), if it's a cache refresh miss, a stale copy of the object is served, while if it's a complete cache miss, a *502* error is returned to let the client (e.g. player) to reattempt. The *502* error also includes a special internal ATS header named {{@ats-internal-messages}} with the appropriate value to allow for custom logging or for plugins to take any appropriate actions (e.g. prevent a fail-over if there's such a plugin that does fail-over on a regular 502 error).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)