You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@trafficserver.apache.org by Fei Deng <du...@apache.org> on 2020/12/17 20:32:20 UTC

Re: ATS Memory Usage Steady Increase

Not saying this is the exact cause, but we've seen similar behavior
previously. The reason for our issue was the session cache size was set to
a size too big compared to the ram size, and since the sessions stored in
the cache are only removed when the cache is full, and inserting new
sessions caused it to trigger *removeOldestSession*. You might want to
check your configurations related to this feature
*proxy.config.ssl.session_cache.size*.

On Thu, Dec 17, 2020 at 1:52 PM Hongfei Zhang <ho...@gmail.com> wrote:

> Hi Folks,
>
> Based on information provided
> https://docs.trafficserver.apache.org/en/8.1.x/admin-guide/performance/index.en.html#memory-allocation and
> with a fixed ram_cache.size setting (32GB), we expected the memory usage to
> be plateaued a couple of days usage.    This is not however what we saw in
> multiple production environments. It seemed the memory usage increases
> steadily overtime, abeilt as a slow pace once the system’s memory usage
> reaches 80-85% (there aren’t many other processes running on the system),
> until to a point ATS process is killed by kernel (oom kill) or human
> intervention (server restart). On a system with 192GB ram (32GB used for
> RAM disk, and ATS configured to use up to 32GB ram cache), peaking
> streaming throughput at 10Gbps, ATS has to be killed/restared in about 2
> weeks.  At peak hours, there are about 5k-6k client connections and less
> than 1k upstream connections (to mid tier caches).
>
> We did some analysis on the Freelist dump (kill -USR1 pid) output (an
> example is attached) and found the allocated in ioBufAllocator[0-14] slots
> appeared to be main contributor to the total and also likely to be the
> source of the increase overtime.
>
> In terms of configurations and plugin usage,  in addition to ram_cache
> setting to 32GB, we also
> changed proxy.config.http.default_buffer_water_mark INT 15000000 (from
> default 64k) to allow the entire video segment to be buffered on the
> upstream connection to avoid client starvation issue when the first client
> comes from a slow draining link, and
> proxy.config.cache.target_fragment_size INT 4096 to allow upstream chunked
> responses to be written into cache storage timely.  There is no connection
> limits (# of connections appeared to be always in the normal range).  The
> inactivity timeout values are fairly low (<120 secs).The only plugin we
> used is header_rewrite.so. No https, no http/2.
>
> I would appreciate if someone can shed some lights on how to further track
> this down, and any practical tips for short term mitigation. In particular:
> 1. Inside HttpSM, which states require allocate/re-use ioBuf? Is there a
> way to put a ceiling on each slot or total allocation?
> 2. Is the ioBufAllocation ceiling a function of total connections in which
> case I should set a connection limit?
> 3. The memory/RamCacheLRUEntry shows 5.2M, how is this related to the
> actual ram_cache usage reported by traffic_top (32GB used)?
> 4. At the point of the freelist dump, ATS process size was 78GB, the
> freelist total showed about 44GB, with 32GB ram_cache used (traffic_top
> reports). Assuming these two number are not overlapping, I also know the
>  in-memory (disk) directory entry cache takes at least 10GB, then these
> numbers do add up. 44+32+10 >> 78. What am I missing?
>
>
> Thanks,
> -Hongfei
>

Re: ATS Memory Usage Steady Increase

Posted by Masaori Koshiba <ma...@apache.org>.

Hi Hongfei,

Recently, we faced a memory leak on the following redirect with 8.1.1[*1].
The fix[*2] is coming in the next release, 8.1.2 and 9.0.1.
If you didn't have your leak with 8.1.0, it might be the same leak.

> 1. Inside HttpSM, which states require allocate/re-use ioBuf? Is there a
way to put a ceiling on each slot or total allocation?
IOBuffer(s) are allocated in many states. In general, Address Sanitizer
(ASan) is helpful to detect memory leaks.

1. Build ATS with `--enable-asan` option
2. Run Traffic Server without freelist (--fF)

[*1] https://github.com/apache/trafficserver/issues/7380
[*2] https://github.com/apache/trafficserver/pull/7401

Thanks,
Masaori

On Tue, Jan 12, 2021 at 4:19 AM Hongfei Zhang <ho...@gmail.com> wrote:

> Thanks Fei.  There weren’t SSL/TLS sessions in our environment but I do
> feel some of memory are being held by ‘dormant’ sessions. The total of
> amount of memory held by freelist (44G) was however surprisingly high.
> Majority of that (99%) are allocated through and held by ioBufAllocator. I
> am wondering if there is anyway to limit the size of these freelists, also
> curious what caused the ‘Allocated’ to continue to go up and why the
> ‘In-Use’ did not go to zero after user traffic stops (and all of the
> keep-alive session times out).
>
> I am also puzzled by the line memory/RamCacheLRUEntry shows only 5.2M,
> where traffic_top shows about 32GB ram cache used.
>
>
> Thanks,
> -Hongfei
>
> > On Dec 17, 2020, at 3:32 PM, Fei Deng <du...@apache.org> wrote:
> >
> > Not saying this is the exact cause, but we've seen similar behavior
> > previously. The reason for our issue was the session cache size was set
> to
> > a size too big compared to the ram size, and since the sessions stored in
> > the cache are only removed when the cache is full, and inserting new
> > sessions caused it to trigger *removeOldestSession*. You might want to
> > check your configurations related to this feature
> > *proxy.config.ssl.session_cache.size*.
> >
> > On Thu, Dec 17, 2020 at 1:52 PM Hongfei Zhang <ho...@gmail.com>
> wrote:
> >
> >> Hi Folks,
> >>
> >> Based on information provided
> >>
> https://docs.trafficserver.apache.org/en/8.1.x/admin-guide/performance/index.en.html#memory-allocation
> and
> >> with a fixed ram_cache.size setting (32GB), we expected the memory
> usage to
> >> be plateaued a couple of days usage.    This is not however what we saw
> in
> >> multiple production environments. It seemed the memory usage increases
> >> steadily overtime, abeilt as a slow pace once the system’s memory usage
> >> reaches 80-85% (there aren’t many other processes running on the
> system),
> >> until to a point ATS process is killed by kernel (oom kill) or human
> >> intervention (server restart). On a system with 192GB ram (32GB used for
> >> RAM disk, and ATS configured to use up to 32GB ram cache), peaking
> >> streaming throughput at 10Gbps, ATS has to be killed/restared in about 2
> >> weeks.  At peak hours, there are about 5k-6k client connections and less
> >> than 1k upstream connections (to mid tier caches).
> >>
> >> We did some analysis on the Freelist dump (kill -USR1 pid) output (an
> >> example is attached) and found the allocated in ioBufAllocator[0-14]
> slots
> >> appeared to be main contributor to the total and also likely to be the
> >> source of the increase overtime.
> >>
> >> In terms of configurations and plugin usage,  in addition to ram_cache
> >> setting to 32GB, we also
> >> changed proxy.config.http.default_buffer_water_mark INT 15000000 (from
> >> default 64k) to allow the entire video segment to be buffered on the
> >> upstream connection to avoid client starvation issue when the first
> client
> >> comes from a slow draining link, and
> >> proxy.config.cache.target_fragment_size INT 4096 to allow upstream
> chunked
> >> responses to be written into cache storage timely.  There is no
> connection
> >> limits (# of connections appeared to be always in the normal range).
> The
> >> inactivity timeout values are fairly low (<120 secs).The only plugin we
> >> used is header_rewrite.so. No https, no http/2.
> >>
> >> I would appreciate if someone can shed some lights on how to further
> track
> >> this down, and any practical tips for short term mitigation. In
> particular:
> >> 1. Inside HttpSM, which states require allocate/re-use ioBuf? Is there a
> >> way to put a ceiling on each slot or total allocation?
> >> 2. Is the ioBufAllocation ceiling a function of total connections in
> which
> >> case I should set a connection limit?
> >> 3. The memory/RamCacheLRUEntry shows 5.2M, how is this related to the
> >> actual ram_cache usage reported by traffic_top (32GB used)?
> >> 4. At the point of the freelist dump, ATS process size was 78GB, the
> >> freelist total showed about 44GB, with 32GB ram_cache used (traffic_top
> >> reports). Assuming these two number are not overlapping, I also know the
> >> in-memory (disk) directory entry cache takes at least 10GB, then these
> >> numbers do add up. 44+32+10 >> 78. What am I missing?
> >>
> >>
> >> Thanks,
> >> -Hongfei
> >>
>
>

Re: ATS Memory Usage Steady Increase

Posted by Hongfei Zhang <ho...@gmail.com>.

Thanks Fei.  There weren’t SSL/TLS sessions in our environment but I do feel some of memory are being held by ‘dormant’ sessions. The total of amount of memory held by freelist (44G) was however surprisingly high. Majority of that (99%) are allocated through and held by ioBufAllocator. I am wondering if there is anyway to limit the size of these freelists, also curious what caused the ‘Allocated’ to continue to go up and why the ‘In-Use’ did not go to zero after user traffic stops (and all of the keep-alive session times out). 

I am also puzzled by the line memory/RamCacheLRUEntry shows only 5.2M, where traffic_top shows about 32GB ram cache used. 


Thanks,
-Hongfei

> On Dec 17, 2020, at 3:32 PM, Fei Deng <du...@apache.org> wrote:
> 
> Not saying this is the exact cause, but we've seen similar behavior
> previously. The reason for our issue was the session cache size was set to
> a size too big compared to the ram size, and since the sessions stored in
> the cache are only removed when the cache is full, and inserting new
> sessions caused it to trigger *removeOldestSession*. You might want to
> check your configurations related to this feature
> *proxy.config.ssl.session_cache.size*.
> 
> On Thu, Dec 17, 2020 at 1:52 PM Hongfei Zhang <ho...@gmail.com> wrote:
> 
>> Hi Folks,
>> 
>> Based on information provided
>> https://docs.trafficserver.apache.org/en/8.1.x/admin-guide/performance/index.en.html#memory-allocation and
>> with a fixed ram_cache.size setting (32GB), we expected the memory usage to
>> be plateaued a couple of days usage.    This is not however what we saw in
>> multiple production environments. It seemed the memory usage increases
>> steadily overtime, abeilt as a slow pace once the system’s memory usage
>> reaches 80-85% (there aren’t many other processes running on the system),
>> until to a point ATS process is killed by kernel (oom kill) or human
>> intervention (server restart). On a system with 192GB ram (32GB used for
>> RAM disk, and ATS configured to use up to 32GB ram cache), peaking
>> streaming throughput at 10Gbps, ATS has to be killed/restared in about 2
>> weeks.  At peak hours, there are about 5k-6k client connections and less
>> than 1k upstream connections (to mid tier caches).
>> 
>> We did some analysis on the Freelist dump (kill -USR1 pid) output (an
>> example is attached) and found the allocated in ioBufAllocator[0-14] slots
>> appeared to be main contributor to the total and also likely to be the
>> source of the increase overtime.
>> 
>> In terms of configurations and plugin usage,  in addition to ram_cache
>> setting to 32GB, we also
>> changed proxy.config.http.default_buffer_water_mark INT 15000000 (from
>> default 64k) to allow the entire video segment to be buffered on the
>> upstream connection to avoid client starvation issue when the first client
>> comes from a slow draining link, and
>> proxy.config.cache.target_fragment_size INT 4096 to allow upstream chunked
>> responses to be written into cache storage timely.  There is no connection
>> limits (# of connections appeared to be always in the normal range).  The
>> inactivity timeout values are fairly low (<120 secs).The only plugin we
>> used is header_rewrite.so. No https, no http/2.
>> 
>> I would appreciate if someone can shed some lights on how to further track
>> this down, and any practical tips for short term mitigation. In particular:
>> 1. Inside HttpSM, which states require allocate/re-use ioBuf? Is there a
>> way to put a ceiling on each slot or total allocation?
>> 2. Is the ioBufAllocation ceiling a function of total connections in which
>> case I should set a connection limit?
>> 3. The memory/RamCacheLRUEntry shows 5.2M, how is this related to the
>> actual ram_cache usage reported by traffic_top (32GB used)?
>> 4. At the point of the freelist dump, ATS process size was 78GB, the
>> freelist total showed about 44GB, with 32GB ram_cache used (traffic_top
>> reports). Assuming these two number are not overlapping, I also know the
>> in-memory (disk) directory entry cache takes at least 10GB, then these
>> numbers do add up. 44+32+10 >> 78. What am I missing?
>> 
>> 
>> Thanks,
>> -Hongfei
>>