You are viewing a plain text version of this content. The canonical link for it is here.
Posted to bugs@httpd.apache.org by bu...@apache.org on 2018/12/13 21:18:29 UTC

[Bug 63010] New: mod_proxy_hcheck when health checks configured, esp. vs down service, consumes large memory

https://bz.apache.org/bugzilla/show_bug.cgi?id=63010

            Bug ID: 63010
           Summary: mod_proxy_hcheck when health checks configured, esp.
                    vs down service, consumes large memory
           Product: Apache httpd-2
           Version: 2.4.37
          Hardware: PC
            Status: NEW
          Severity: major
          Priority: P2
         Component: mod_proxy_hcheck
          Assignee: bugs@httpd.apache.org
          Reporter: carltonf@gmail.com
  Target Milestone: ---

Defect related to mod_proxy_hcheck

Found/repro's in Apache httpd version 2.4.33, 2.4.35, and 2.4.37
(mod_proxy_hcheck module) on Windows.

We added the hcheck module and then noticed that httpd started consuming large
amounts of memory. In previous versions before configuring the module, httpd
would be stable at about 35MB of ram. With this module, the memory would grow
to 3GB and stabilize. One repro actually had it grow to 17GB and the machine
ran out of pagefile space and crashed. We noticed that in the default
configuration we were checking a service that was not running. After
configuring this service to run, the memory used dropped significantly. Instead
of 3GB, it grew to 150MB and stabilized.

Even with the service running, there is a period of time shortly after startup
during which the healthcheck runs hundreds of times for each service each
second. During this time the memory consumption steadily grows. It does this
for about 5 seconds, generating thousands of lines of (trace) level logging and
running the health check over a thousand times for each service. Then for some
reason it stabilizes, and checks are seen at a more normal 30 second interval.
But the memory is not released and the process still has an elevated amount of
memory. This might be due to the healthcheck module just requiring more memory
than running without it. But it is concerning as it seems like increased memory
consumption is related to checking a service which is down. If a service was
down, you have one problem, but if the healthcheck itself starts consuming tons
of resources on top of that, you compound the problem.


278 times in one second:
Condition ok1234 for 6af324730 (http://localhost:8843): passed

This sequence repeated (this is logging from one single thread checking one
endpoint) (please see attached file for a full one-second interval of logging,
15k+ lines):
[Thu Dec 13 17:58:40.618694 2018] [proxy_hcheck:debug] [pid 21288:tid 748]
mod_proxy_hcheck.c(829): AH03256: Threaded Health checking
http://localhost:8843
[Thu Dec 13 17:58:40.618694 2018] [proxy:debug] [pid 21288:tid 748]
proxy_util.c(2313): AH00942: HCOH: has acquired connection for (localhost)
[Thu Dec 13 17:58:40.619694 2018] [proxy:trace2] [pid 21288:tid 748]
proxy_util.c(3010): HCOH: fam 2 socket created to connect to localhost
[Thu Dec 13 17:58:40.619694 2018] [proxy:debug] [pid 21288:tid 748]
proxy_util.c(3042): AH02824: HCOH: connection established with 127.0.0.1:8843
(localhost)
[Thu Dec 13 17:58:40.620696 2018] [proxy:debug] [pid 21288:tid 748]
proxy_util.c(3212): AH00962: HCOH: connection complete to 127.0.0.1:8843
(localhost)
[Thu Dec 13 17:58:40.649654 2018] [proxy_hcheck:debug] [pid 21288:tid 748]
mod_proxy_hcheck.c(644): AH03254: HTTP/1.1 200 OK
[Thu Dec 13 17:58:40.649654 2018] [proxy_hcheck:trace2] [pid 21288:tid 748]
mod_proxy_hcheck.c(797): Condition ok1234 for 6af324730
(http://localhost:8843): passed
[Thu Dec 13 17:58:40.649654 2018] [proxy:debug] [pid 21288:tid 748]
proxy_util.c(2328): AH00943: HCOH: has released connection for (localhost)
[Thu Dec 13 17:58:40.649654 2018] [proxy_hcheck:debug] [pid 21288:tid 748]
mod_proxy_hcheck.c(573): AH03251: Health check GET Status (0) for 6af324730.


excerpts from httpd.conf

...
# This defines a template for use by proxy_hcheck_module.
# Some other specific parameters are set directly on the BalancerMember lines.
ProxyHCExpr ok1234 {%{REQUEST_STATUS} =~ /^[1234]/}
ProxyHCTemplate hcsimple hcmethod=GET hcexpr=ok1234 hcuri=/favicon.ico
...
<Proxy balancer://samlservice-cluster>
BalancerMember http://localhost:8843 route=1 hcinterval=30 hcfails=2 hcpasses=1
hctemplate=hcsimple
</Proxy>
...
LogLevel debug
LogLevel proxy_hcheck:TRACE2
LogLevel proxy:TRACE2
LogLevel watchdog:TRACE2
...

Related:
Have fix for this issue in place
https://bz.apache.org/bugzilla/show_bug.cgi?id=62318

This issue has similarities - note comments about consuming 20GB heap space:
https://bugs.httpd.apache.narkive.com/tyCY7SPh/bug-61499-new-tcp-healthchecks-failing-falsely-not-actually-checking

Also noticed this issue on Reddit
https://www.reddit.com/r/apache/comments/9rs4cd/apache_24_memory_leak_when_using_mod_proxy/

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org


[Bug 63010] mod_proxy_hcheck when health checks configured, esp. vs down service, consumes large memory

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=63010

--- Comment #3 from pawel.c <pa...@ingbank.pl> ---
Replicable also on 39, Windows Server 2012 

It is about health check memory leaks depends on frequency.

To reproduce...
(1) You have to have more than one high freq health check like...
ProxyHCExpr wsOK {%{REQUEST_STATUS} =~ /405|200/}
ProxyHCTemplate X1 hcexpr=wsOK hcmethod=GET hcuri=WHATEVER_WORKS_FOR_YOU_1
hcinterval=5 hcpasses=2 hcfails=1
<Proxy balancer://clusterFxAgregation>
BalancerMember http://127.0.0.1:10179 route=127.0.0.1 retry=0 hctemplate=X1
BalancerMember http://10.115.51.15:10179 status=+H route=10.115.51.15 retry=0
hctemplate=X1
BalancerMember http://10.115.51.27:10179 status=+H route=10.115.51.27 retry=0
hctemplate=X1
</Proxy>

ProxyHCTemplate X2 hcexpr=wsOK hcmethod=GET hcuri=WHATEVER_WORKS_FOR_YOU_2
hcinterval=5 hcpasses=2 hcfails=1
<Proxy balancer://clusterFxAgregation>
BalancerMember http://127.0.0.1:10178 route=127.0.0.1 retry=0 hctemplate=X2
BalancerMember http://10.115.51.15:10178 status=+H route=10.115.51.15 retry=0
hctemplate=X2
BalancerMember http://10.115.51.27:10178 status=+H route=10.115.51.27 retry=0
hctemplate=X2
</Proxy>

here those are defined as two virtual hosts in the same conf

(2) Some nodes must be down. For instance 127.0.0.1 in example above

(3) Observe memory. Leak spead depends on number of defined checks and their
freq. With configuration like above and around 20 virtual hosts it should be
like 10MB/min

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org


[Bug 63010] mod_proxy_hcheck when health checks configured, esp. vs down service, consumes large memory

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=63010

Christophe JAILLET <ch...@wanadoo.fr> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Blocks|                            |61499


Referenced Bugs:

https://bz.apache.org/bugzilla/show_bug.cgi?id=61499
[Bug 61499] TCP healthchecks failing falsely / not actually checking
-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org


[Bug 63010] mod_proxy_hcheck when health checks configured, esp. vs down service, consumes large memory

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=63010

WJCarpenter <bi...@carpenter.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 OS|                            |All
                 CC|                            |bill-apache@carpenter.org

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org


[Bug 63010] mod_proxy_hcheck when health checks configured, esp. vs down service, consumes large memory

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=63010

--- Comment #9 from Yann Ylavic <yl...@gmail.com> ---
Hi Cristian,

(In reply to Cristian Duca from comment #8)
> 
> I work with Tudor.
> We already implemented that fix and it works as expected -> apache is stable
> and we are not encountering any memory leaks issues.

I'm not sure to understand, did you test the fix/patch from comment 7 already?

> We have configured our hcinterval at 10 seconds (due to a LoadBalancer Ping
> Test), hcpasses=2 and increased the ProxyHCTPSize to 256 threads.
> We wanted to know if/when this issue will be fixed coding-wise and not
> mitigated through configuration. 

If the fix you tested is this mitigation (by configuration), it would help /
speed up the backport to 2.4 if you could test the patch (in the code) proposed
in comment 7.
If it works with no particular mitigation in the configuration, then we can
consider that it addresses the issue and do the backport to 2.4 (next release).

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org


[Bug 63010] mod_proxy_hcheck when health checks configured, esp. vs down service, consumes large memory

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=63010

--- Comment #11 from Cristian Duca <du...@yahoo.com> ---
Hello, Yann,

I have compiled successfully apache on windows.
I have implemented the changes from Comment7 for the mod_proxy_hcheck module.

I can confirm that the fix from Comment7 works without any workarounds.

Can we expect to have this fix implemented in the next apache release build ?

Thank you,
Cristian Duca.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org


[Bug 63010] mod_proxy_hcheck when health checks configured, esp. vs down service, consumes large memory

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=63010

--- Comment #6 from Tudor Tudorache <tu...@cerner.com> ---
We are running into the same issue with our current configuration. Apache
develops a serious memory leak hitting over 1.2 GB if one of the JVM's are
started or stopped. We were on Apache 2.4.26 previously with the exact same
configuration, and never ran into this. 
We are still waiting for a proper fix, any estimate on that?
Thank you,
Tudor

I can send out httpd configuration with what balancer HC's we run.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org


[Bug 63010] mod_proxy_hcheck when health checks configured, esp. vs down service, consumes large memory

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=63010

--- Comment #7 from Yann Ylavic <yl...@gmail.com> ---
Created attachment 37621
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=37621&action=edit
Avoid concurrent checks

This patch resets the timestamp during the check (and prevents re-entering the
check if the timestamp is zero), and then updates the timestamp after the check
to schedule the next one.

That's fix 1 & 2 from comment 4.

How does it work for you?

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org


[Bug 63010] mod_proxy_hcheck when health checks configured, esp. vs down service, consumes large memory

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=63010

--- Comment #10 from Cristian Duca <du...@yahoo.com> ---
Hello, Yann,

I have tested the workaround by adding more ProxyHCTPSize.

Now I am trying to compile Apahce on Windows and to test the coding changes
from comment 7 without any mitigations added into the Apache configuration ->
at the moment I am not able to compile Apache on Windows -> Do you a wiki link
on how to compile the latest apache-trunk source files by using VS2017?

Thank you,
Cristian Duca.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org


[Bug 63010] mod_proxy_hcheck when health checks configured, esp. vs down service, consumes large memory

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=63010

--- Comment #8 from Cristian Duca <du...@yahoo.com> ---
Hello Yann,

I work with Tudor.
We already implemented that fix and it works as expected -> apache is stable
and we are not encountering any memory leaks issues.
We have configured our hcinterval at 10 seconds (due to a LoadBalancer Ping
Test), hcpasses=2 and increased the ProxyHCTPSize to 256 threads.
We wanted to know if/when this issue will be fixed coding-wise and not
mitigated through configuration. 

Thank you,
Cristian Duca.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org


[Bug 63010] mod_proxy_hcheck when health checks configured, esp. vs down service, consumes large memory

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=63010

--- Comment #2 from CFindley <ca...@gmail.com> ---
Note: we have a very lengthy and aged httpd.conf file which has cruft from
years of fixing defects and adding new options, so it is entirely possible this
is some artifact of that. I did try setting up a very simple configuration
which just health checks a single non-existent service, and it seems to be well
behaved. I can't share the config file as it contains sensitive (and
embarrassing) information unfortunately. :)

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org


[Bug 63010] mod_proxy_hcheck when health checks configured, esp. vs down service, consumes large memory

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=63010

--- Comment #4 from WJCarpenter <bi...@carpenter.org> ---
I've spent quite a bit of time over the last few days trying to understand what
goes on here. I have reproduced the memory growth on both Windows and Linux
with Apache httpd 2.4.46. It doesn't depend on health check failures, though
they can affect the timing to make it more likely. (In my testing, I dummied
out the actual check in hc_check() and just put in a fixed delay to avoid
vagaries of network stuff.)

My conclusion is that it's strongly related to bug #62318
(https://bz.apache.org/bugzilla/show_bug.cgi?id=62318). The fix for that didn't
account for an important edge case. When should we mark a worker health check
as "updated", where "updated" means the last time it was checked? Originally,
the worker was marked as "updated" when the check completed, though the
timestamp used was when the check started. Any watchdog firings while the check
was still running would trigger additional checks to run for that worker since
the trigger condition was checking the "updated" timestamp that continued to
have an old value until that check completed.

The previous fix used the same timestamp for "updated", but it changed when the
timestamp was applied. It applied the timestamp at the beginning of the check.
That (mostly) prevents additional checks from being triggered while the first
one is running.

The edge case that isn't accounted for is when all of the threads in the pool
used for health checks are busy. In that case, the "updated" timestamp is not
applied until the health check task is popped from the queue and given to a
thread to run. Let's look at an example (which I actually used experimentally).
Suppose it takes 5 seconds to do a health check. That's a long time, but not
preposterous. The watchdog callback runs every 100ms, and the default thread
pool is 16 threads. With just a single worker to check, 49 additional checks
will be triggered before the first check finishes. Each will carry a timestamp
just 100ms later than the one just before it. The first 16 checks will be given
to threads for immediate execution, with the final one updating "updated" 1.5
seconds later than the first one. The other 34 or so will be queued until
threads free up to run them. Each of those queued health checks carries along
with it a timestamp to be applied to "updated", and the timestamp reflects the
time that the health check task was enqueued. There's just one "updated" field
for the worker, regardless of how many health check tasks there are for that
worker. The "updated" timestamp will always reflect the watchdog trigger time
of the health check task that started most recently. In our example, the
"updated" timestamps eventually be marked with a time about 5 seconds later
than when the first check started.

With the default health check interval ("hcinterval") of 60 seconds, things
settle down until the 65 second point or so. Then the cycle repeats. There is
wasted effort of redundant health checks (which might foil expectations for
"hcfails" and "hcpasses"), and the "hcinterval" is effectively changed from 60
seconds to about 65 seconds. Those things are undesirable but not a tragedy.

Having just a single worker is unlikely. For every worker of some kind getting
health checks, the above scenario plays out. It's easy to get hundreds of
health check tasks queued up waiting for threads to process them. Still, the
net effect would be merely stretching the "hcinterval" time for each worker by
the time it takes to do a check. It's 5 seconds in our example, but it would be
way less for a healthy system. If the system is not healthy, the time it takes
to do health check can be faster or slower. If we try to connect to a port with
no listener process, then we will very quickly get a "connection refused". If
an entire server is down, we won't get any response at all and will have to
wait for a self-imposed socket timeout (from the "connectiontimeout" parameter,
default 60s). Or, the process on the server might be in a weird state so that
the health check takes a long time to complete without causing a socket timeout
(from the "timeout" parameter, default 60s). If the time to complete the health
check exceeds the "hcinterval" for the worker, we will never catch up. A
growing number of health check tasks will be queued for that worker. Each
queued health check task is holding a chunk of heap memory, and that is what
accounts for the memory growth. (As one anecdotal data point, on Windows my
httpd normally runs at 15-16 MB process size. With my pathological test case,
it's easy for me to see it grow to a couple of GB over a couple hours or so
with many tens of thousands of trapped health check tasks in the queue.)

So, what to do to workaround or fix this edge case?

* Workaround 1: Have at least as many threads in the thread pool
(ProxyHCTPSize) as you have distinct workers getting health checks. That way,
health check tasks will not usually get queued (though it still could happen
occasionally, so maybe allocate some extra threads). The threads will be idle
most of the time.

* Workaround 2: Make your health check intervals ("hcinterval") long enough to
avoid the pathlogical case. That is, health checks will always complete, pass
or fail, before the next check is due. Of course, you lose the benefit of more
frequent health checks.

* Workaround 3: Don't use the thread pool for health checks, which you can do
by setting ProxyHCTPSize=0. That will make all health checks synchronous, which
is generally pretty undesirable.

* Fix 1: Apply the timestamp to "updated" before sending the health check task
to the pool. In mod_proxy_hcheck.c, that's in the function
hc_watchdog_callback() before the call to apr_thread_pool_push(). That will
keep most additional health check tasks from being created for the same worker,
though that can still happen if a health check takes longer than the health
check interval.

* Fix 2: Put in an explcit check that a worker has a health check in flight,
and don't create any new checks as long as that's true. This needs only a
single bit flag in the worker data blob that already exists. Set that flag on
the worker before you call apr_thread_pool_push() and clear the flag at the
completion of hc_check(). I don't see the value in having simultaneious
in-flight health checks for the same worker, but maybe someone else does.

If I were deciding, I would do both Fix 1 and Fix 2, though you could argue
that Fix 2 makes Fix 1 unnecessary (depending on how you feel about the meaning
of "hcinterval", as mentioned in the discussion of bug #62318). I have
prototyped both fixes to verify that it resolves the problem (or certainly
makes it much harder to trigger).

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org


[Bug 63010] mod_proxy_hcheck when health checks configured, esp. vs down service, consumes large memory

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=63010

--- Comment #5 from Jim Jagielski <ji...@apache.org> ---
Thanks for the incredible bug report and the excellent investigation. I agree
that Fixes 1+2 make the most sense. I'll look into adding them asap.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org


[Bug 63010] mod_proxy_hcheck when health checks configured, esp. vs down service, consumes large memory

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=63010

Yann Ylavic <yl...@gmail.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|NEW                         |RESOLVED

--- Comment #12 from Yann Ylavic <yl...@gmail.com> ---
Fixed in 2.4.x (r1887119), will be in the upcoming release.

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org


[Bug 63010] mod_proxy_hcheck when health checks configured, esp. vs down service, consumes large memory

Posted by bu...@apache.org.
https://bz.apache.org/bugzilla/show_bug.cgi?id=63010

--- Comment #1 from CFindley <ca...@gmail.com> ---
Created attachment 36324
  --> https://bz.apache.org/bugzilla/attachment.cgi?id=36324&action=edit
error.log output covering 2-tenths of a second/5k lines

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: bugs-unsubscribe@httpd.apache.org
For additional commands, e-mail: bugs-help@httpd.apache.org