You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@logging.apache.org by Denys Smolainiuk <de...@harmonicinc.com> on 2018/08/16 15:31:14 UTC

[log4cxx] Logging in Timing-Critical Applications

Hello All,

I'd like to share some experience as well as some patches with regard 
to using log4cxx in timing-critical application. First few words about 
our requirements: it's a service which must generate some network 
packets with up to hundred of microseconds precision. Thus, it's very 
important to have predictable code timing. One can argue that log4cxx 
is not very well suited for such applications, but surprisingly it 
works pretty well after some light tuning.

So, what were the issues?
Basically from library user's point of view they looked the same: one 
of a sudden logging done with LOG4CXX_DEBUG() macro could take 
unexpectedly long time to complete. For example the same trace which 
takes several μs for 99% of the time would take hundreds microseconds 
and even few milliseconds sometimes. After further investigation this 
has been traced down to few of the root-causes:

1. Asyns logger (which we have been using of course) has internal queue 
to pass log entries to background disk-writer thread. This queue is 
mutex-protected which might seem fine unless you think a little bit 
more about it. First of all, someone calling LOG4CXX_DEBUG() to simply 
put something into the log might not expect to be blocked inside 
waiting for a mutex at all. Second point is that, although there were 
measures taken to minimize time disk-thread holds that lock, 
OS-schedulers often work in a way that thread which is blocked on a 
mutex gets de-scheduled. With normal OS-scheduler quantum that means 
that the logging thread can be preempted for milliseconds.

2. There are some mutexes protecting internal states of both loggers 
and appenders. This means that two separate threads calling 
LOG4CXX_DEBUG() can block each other. Even if they are using different 
loggers they would block on appender! This has the same consequences 
for execution timing and the performance as described above.

3. std::stringstream class constructor has some internal locks on it's 
own. Unfortunately each MessageBuffer has it's own instance of this 
class. And also unfortunately MessageBuffer is created inside 
LOG4CXX_DEBUG() macro. There is optimization to not create stringstream 
for logging simple strings, but as soon as your log statement has 
single '<<' operator it's created.

4. Dynamic memory allocations. Unfortunately there are still quite few 
of them even though memory pool is used in some other places. Thus, 
hidden calls to new and malloc induce unpredictable delays.

So, what we did to mitigate these problems?

1. Natural solution for this issue was to use atomic queue. There are 
few of them available, but we made use of boost::lockfree::queue as it 
can serve as a drop-in replacement allowing us to keep all present 
functionality.

2. After looking more into the code it has appeared that two concurrent 
calls to LOG4CXX_DEBUG() from within different threads are not harmful 
because internal structures of logger and appender are not being 
changed there. What only really requires protection is concurrency 
between logging and configuring. Thus, we came to a solution - 
read-write locks where logging calls act as readers and 
configuration/exiting calls are writers. With such approach multiple 
threads calling LOG4CXX_DEBUG() became free of any contention.

3. This problem also has one simple solution - make one static 
std::stringstream object per thread using std::thread_local. 
Unfortunately we found one drawback - thread_local memory is not 
released if thread is not detached or joined. As there is some code 
which does neither of this we made static stringstream a xml file 
configuration option. Also, there could be an issue with using multiple 
MessageBuffer instances from within single thread, but LOG4CXX_DEBUG() 
is not doing that.

4. At this time we didn't do anything to address dynamic memory 
allocation issue.

So, if you want to give our patch a try it is attached. It's based on 
log4cxx 0.10.0-12 as found in Debian. There are new SHARED_MUTEX and 
LOCK_R/LOCK_W macros defined in mutex.h and synchronized.h for easy 
switching between RW and convenient mutexes for benchmarking. Also, 
there is an test application which spawns two threads doing logging 
with some small sleep between iterations. It prints to stdout if trace 
statement took more than 500 microseconds. It might look familiar too 
you because it's based on one of the examples from StackOverflow.

In our testing modified log4cxx has about 20% better performance 
overall, but more importantly for us it has much less cases when log 
statement takes excessive time to complete. The second part is only 
true for CPUs with >2 cores where all threads can physically run in 
parallel. Unfortunately we still see rare cases with prolonged logging 
calls, we have traced that down to dynamic memory allocation issue.

Any thoughts?

Denys Smolianiuk

Senior SW Engineer
Harmonic Inc.

Re: [log4cxx] Logging in Timing-Critical Applications

Posted by Remko Popma <re...@gmail.com>.

All these sound like very nice improvements!

On Fri, Aug 17, 2018 at 4:13 AM, Robert Middleton <os...@gmail.com>
wrote:

> Interesting results!  I haven't experienced any problems with logging
> myself, but I've also never gone into benchmarking it.
>
> I don't have much experience with memory pools, but if you're still
> getting blockages on a call to new/malloc you may be able to get
> around it by using the memory pool and then calling the constructor
> in-place(I have never done this before, but I know it is possible to
> do). [1][2].  Or there may be some way to do it using a custom
> allocator, that's something that I have never done either.
>
> -Robert Middleton
>
> [1]: https://stackoverflow.com/questions/519808/call-a-
> constructor-on-a-already-allocated-memory
> [2]: https://stackoverflow.com/questions/222557/what-uses-
> are-there-for-placement-new
>
> On Thu, Aug 16, 2018 at 11:58 AM, Matt Sicker <bo...@gmail.com> wrote:
> > I don't know much about the state of log4cxx architecture, but nearly all
> > your points (other than the lock in stringstream) are points we optimize
> > for in log4j2 at least. Even the stringstream optimization sounds similar
> > to StringBuffer versus StringBuilder in java. As for the queue used in
> > async logging, I'm not sure what guarantees you get in C++ memory models,
> > but I'm curious if the disruptor queue pattern made its way over to
> boost?
> >
> > On Thu, 16 Aug 2018 at 10:37, Denys Smolainiuk <
> > denys.smolianiuk@harmonicinc.com> wrote:
> >
> >> Hello All,
> >>
> >> I'd like to share some experience as well as some patches with regard to
> >> using log4cxx in timing-critical application. First few words about our
> >> requirements: it's a service which must generate some network packets
> with
> >> up to hundred of microseconds precision. Thus, it's very important to
> have
> >> predictable code timing. One can argue that log4cxx is not very well
> suited
> >> for such applications, but surprisingly it works pretty well after some
> >> light tuning.
> >>
> >> So, what were the issues?
> >> Basically from library user's point of view they looked the same: one
> of a
> >> sudden logging done with LOG4CXX_DEBUG() macro could take unexpectedly
> long
> >> time to complete. For example the same trace which takes several μs for
> 99%
> >> of the time would take hundreds microseconds and even few milliseconds
> >> sometimes. After further investigation this has been traced down to few
> of
> >> the root-causes:
> >>
> >> 1. Asyns logger (which we have been using of course) has internal queue
> to
> >> pass log entries to background disk-writer thread. This queue is
> >> mutex-protected which might seem fine unless you think a little bit more
> >> about it. First of all, someone calling LOG4CXX_DEBUG() to simply put
> >> something into the log might not expect to be blocked inside waiting
> for a
> >> mutex at all. Second point is that, although there were measures taken
> to
> >> minimize time disk-thread holds that lock, OS-schedulers often work in a
> >> way that thread which is blocked on a mutex gets de-scheduled. With
> normal
> >> OS-scheduler quantum that means that the logging thread can be preempted
> >> for milliseconds.
> >>
> >> 2. There are some mutexes protecting internal states of both loggers and
> >> appenders. This means that two separate threads calling LOG4CXX_DEBUG()
> can
> >> block each other. Even if they are using different loggers they would
> block
> >> on appender! This has the same consequences for execution timing and the
> >> performance as described above.
> >>
> >> 3. std::stringstream class constructor has some internal locks on it's
> >> own. Unfortunately each MessageBuffer has it's own instance of this
> class.
> >> And also unfortunately MessageBuffer is created inside LOG4CXX_DEBUG()
> >> macro. There is optimization to not create stringstream for logging
> simple
> >> strings, but as soon as your log statement has single '<<' operator it's
> >> created.
> >>
> >> 4. Dynamic memory allocations. Unfortunately there are still quite few
> of
> >> them even though memory pool is used in some other places. Thus, hidden
> >> calls to new and malloc induce unpredictable delays.
> >>
> >> So, what we did to mitigate these problems?
> >>
> >> 1. Natural solution for this issue was to use atomic queue. There are
> few
> >> of them available, but we made use of boost::lockfree::queue as it can
> >> serve as a drop-in replacement allowing us to keep all present
> >> functionality.
> >>
> >> 2. After looking more into the code it has appeared that two concurrent
> >> calls to LOG4CXX_DEBUG() from within different threads are not harmful
> >> because internal structures of logger and appender are not being changed
> >> there. What only really requires protection is concurrency between
> logging
> >> and configuring. Thus, we came to a solution - read-write locks where
> >> logging calls act as readers and configuration/exiting calls are
> writers.
> >> With such approach multiple threads calling LOG4CXX_DEBUG() became free
> of
> >> any contention.
> >>
> >> 3. This problem also has one simple solution - make one static
> >> std::stringstream object per thread using std::thread_local.
> Unfortunately
> >> we found one drawback - thread_local memory is not released if thread is
> >> not detached or joined. As there is some code which does neither of
> this we
> >> made static stringstream a xml file configuration option. Also, there
> could
> >> be an issue with using multiple MessageBuffer instances from within
> single
> >> thread, but LOG4CXX_DEBUG() is not doing that.
> >>
> >> 4. At this time we didn't do anything to address dynamic memory
> allocation
> >> issue.
> >>
> >> So, if you want to give our patch a try it is attached. It's based on
> >> log4cxx 0.10.0-12 as found in Debian. There are new SHARED_MUTEX and
> >> LOCK_R/LOCK_W macros defined in mutex.h and synchronized.h for easy
> >> switching between RW and convenient mutexes for benchmarking. Also,
> there
> >> is an test application which spawns two threads doing logging with some
> >> small sleep between iterations. It prints to stdout if trace statement
> took
> >> more than 500 microseconds. It might look familiar too you because it's
> >> based on one of the examples from StackOverflow.
> >>
> >> In our testing modified log4cxx has about 20% better performance
> overall,
> >> but more importantly for us it has much less cases when log statement
> takes
> >> excessive time to complete. The second part is only true for CPUs with
> >2
> >> cores where all threads can physically run in parallel. Unfortunately we
> >> still see rare cases with prolonged logging calls, we have traced that
> down
> >> to dynamic memory allocation issue.
> >>
> >> Any thoughts?
> >>
> >> Denys Smolianiuk
> >>
> >> Senior SW Engineer
> >> Harmonic Inc.
> >>
> >>
> >
> > --
> > Matt Sicker <bo...@gmail.com>
>

Re: [log4cxx] Logging in Timing-Critical Applications

Posted by Robert Middleton <os...@gmail.com>.

Interesting results!  I haven't experienced any problems with logging
myself, but I've also never gone into benchmarking it.

I don't have much experience with memory pools, but if you're still
getting blockages on a call to new/malloc you may be able to get
around it by using the memory pool and then calling the constructor
in-place(I have never done this before, but I know it is possible to
do). [1][2].  Or there may be some way to do it using a custom
allocator, that's something that I have never done either.

-Robert Middleton

[1]: https://stackoverflow.com/questions/519808/call-a-constructor-on-a-already-allocated-memory
[2]: https://stackoverflow.com/questions/222557/what-uses-are-there-for-placement-new

On Thu, Aug 16, 2018 at 11:58 AM, Matt Sicker <bo...@gmail.com> wrote:
> I don't know much about the state of log4cxx architecture, but nearly all
> your points (other than the lock in stringstream) are points we optimize
> for in log4j2 at least. Even the stringstream optimization sounds similar
> to StringBuffer versus StringBuilder in java. As for the queue used in
> async logging, I'm not sure what guarantees you get in C++ memory models,
> but I'm curious if the disruptor queue pattern made its way over to boost?
>
> On Thu, 16 Aug 2018 at 10:37, Denys Smolainiuk <
> denys.smolianiuk@harmonicinc.com> wrote:
>
>> Hello All,
>>
>> I'd like to share some experience as well as some patches with regard to
>> using log4cxx in timing-critical application. First few words about our
>> requirements: it's a service which must generate some network packets with
>> up to hundred of microseconds precision. Thus, it's very important to have
>> predictable code timing. One can argue that log4cxx is not very well suited
>> for such applications, but surprisingly it works pretty well after some
>> light tuning.
>>
>> So, what were the issues?
>> Basically from library user's point of view they looked the same: one of a
>> sudden logging done with LOG4CXX_DEBUG() macro could take unexpectedly long
>> time to complete. For example the same trace which takes several μs for 99%
>> of the time would take hundreds microseconds and even few milliseconds
>> sometimes. After further investigation this has been traced down to few of
>> the root-causes:
>>
>> 1. Asyns logger (which we have been using of course) has internal queue to
>> pass log entries to background disk-writer thread. This queue is
>> mutex-protected which might seem fine unless you think a little bit more
>> about it. First of all, someone calling LOG4CXX_DEBUG() to simply put
>> something into the log might not expect to be blocked inside waiting for a
>> mutex at all. Second point is that, although there were measures taken to
>> minimize time disk-thread holds that lock, OS-schedulers often work in a
>> way that thread which is blocked on a mutex gets de-scheduled. With normal
>> OS-scheduler quantum that means that the logging thread can be preempted
>> for milliseconds.
>>
>> 2. There are some mutexes protecting internal states of both loggers and
>> appenders. This means that two separate threads calling LOG4CXX_DEBUG() can
>> block each other. Even if they are using different loggers they would block
>> on appender! This has the same consequences for execution timing and the
>> performance as described above.
>>
>> 3. std::stringstream class constructor has some internal locks on it's
>> own. Unfortunately each MessageBuffer has it's own instance of this class.
>> And also unfortunately MessageBuffer is created inside LOG4CXX_DEBUG()
>> macro. There is optimization to not create stringstream for logging simple
>> strings, but as soon as your log statement has single '<<' operator it's
>> created.
>>
>> 4. Dynamic memory allocations. Unfortunately there are still quite few of
>> them even though memory pool is used in some other places. Thus, hidden
>> calls to new and malloc induce unpredictable delays.
>>
>> So, what we did to mitigate these problems?
>>
>> 1. Natural solution for this issue was to use atomic queue. There are few
>> of them available, but we made use of boost::lockfree::queue as it can
>> serve as a drop-in replacement allowing us to keep all present
>> functionality.
>>
>> 2. After looking more into the code it has appeared that two concurrent
>> calls to LOG4CXX_DEBUG() from within different threads are not harmful
>> because internal structures of logger and appender are not being changed
>> there. What only really requires protection is concurrency between logging
>> and configuring. Thus, we came to a solution - read-write locks where
>> logging calls act as readers and configuration/exiting calls are writers.
>> With such approach multiple threads calling LOG4CXX_DEBUG() became free of
>> any contention.
>>
>> 3. This problem also has one simple solution - make one static
>> std::stringstream object per thread using std::thread_local. Unfortunately
>> we found one drawback - thread_local memory is not released if thread is
>> not detached or joined. As there is some code which does neither of this we
>> made static stringstream a xml file configuration option. Also, there could
>> be an issue with using multiple MessageBuffer instances from within single
>> thread, but LOG4CXX_DEBUG() is not doing that.
>>
>> 4. At this time we didn't do anything to address dynamic memory allocation
>> issue.
>>
>> So, if you want to give our patch a try it is attached. It's based on
>> log4cxx 0.10.0-12 as found in Debian. There are new SHARED_MUTEX and
>> LOCK_R/LOCK_W macros defined in mutex.h and synchronized.h for easy
>> switching between RW and convenient mutexes for benchmarking. Also, there
>> is an test application which spawns two threads doing logging with some
>> small sleep between iterations. It prints to stdout if trace statement took
>> more than 500 microseconds. It might look familiar too you because it's
>> based on one of the examples from StackOverflow.
>>
>> In our testing modified log4cxx has about 20% better performance overall,
>> but more importantly for us it has much less cases when log statement takes
>> excessive time to complete. The second part is only true for CPUs with >2
>> cores where all threads can physically run in parallel. Unfortunately we
>> still see rare cases with prolonged logging calls, we have traced that down
>> to dynamic memory allocation issue.
>>
>> Any thoughts?
>>
>> Denys Smolianiuk
>>
>> Senior SW Engineer
>> Harmonic Inc.
>>
>>
>
> --
> Matt Sicker <bo...@gmail.com>

Re: [log4cxx] Logging in Timing-Critical Applications

Posted by Matt Sicker <bo...@gmail.com>.

I don't know much about the state of log4cxx architecture, but nearly all
your points (other than the lock in stringstream) are points we optimize
for in log4j2 at least. Even the stringstream optimization sounds similar
to StringBuffer versus StringBuilder in java. As for the queue used in
async logging, I'm not sure what guarantees you get in C++ memory models,
but I'm curious if the disruptor queue pattern made its way over to boost?

On Thu, 16 Aug 2018 at 10:37, Denys Smolainiuk <
denys.smolianiuk@harmonicinc.com> wrote:

> Hello All,
>
> I'd like to share some experience as well as some patches with regard to
> using log4cxx in timing-critical application. First few words about our
> requirements: it's a service which must generate some network packets with
> up to hundred of microseconds precision. Thus, it's very important to have
> predictable code timing. One can argue that log4cxx is not very well suited
> for such applications, but surprisingly it works pretty well after some
> light tuning.
>
> So, what were the issues?
> Basically from library user's point of view they looked the same: one of a
> sudden logging done with LOG4CXX_DEBUG() macro could take unexpectedly long
> time to complete. For example the same trace which takes several μs for 99%
> of the time would take hundreds microseconds and even few milliseconds
> sometimes. After further investigation this has been traced down to few of
> the root-causes:
>
> 1. Asyns logger (which we have been using of course) has internal queue to
> pass log entries to background disk-writer thread. This queue is
> mutex-protected which might seem fine unless you think a little bit more
> about it. First of all, someone calling LOG4CXX_DEBUG() to simply put
> something into the log might not expect to be blocked inside waiting for a
> mutex at all. Second point is that, although there were measures taken to
> minimize time disk-thread holds that lock, OS-schedulers often work in a
> way that thread which is blocked on a mutex gets de-scheduled. With normal
> OS-scheduler quantum that means that the logging thread can be preempted
> for milliseconds.
>
> 2. There are some mutexes protecting internal states of both loggers and
> appenders. This means that two separate threads calling LOG4CXX_DEBUG() can
> block each other. Even if they are using different loggers they would block
> on appender! This has the same consequences for execution timing and the
> performance as described above.
>
> 3. std::stringstream class constructor has some internal locks on it's
> own. Unfortunately each MessageBuffer has it's own instance of this class.
> And also unfortunately MessageBuffer is created inside LOG4CXX_DEBUG()
> macro. There is optimization to not create stringstream for logging simple
> strings, but as soon as your log statement has single '<<' operator it's
> created.
>
> 4. Dynamic memory allocations. Unfortunately there are still quite few of
> them even though memory pool is used in some other places. Thus, hidden
> calls to new and malloc induce unpredictable delays.
>
> So, what we did to mitigate these problems?
>
> 1. Natural solution for this issue was to use atomic queue. There are few
> of them available, but we made use of boost::lockfree::queue as it can
> serve as a drop-in replacement allowing us to keep all present
> functionality.
>
> 2. After looking more into the code it has appeared that two concurrent
> calls to LOG4CXX_DEBUG() from within different threads are not harmful
> because internal structures of logger and appender are not being changed
> there. What only really requires protection is concurrency between logging
> and configuring. Thus, we came to a solution - read-write locks where
> logging calls act as readers and configuration/exiting calls are writers.
> With such approach multiple threads calling LOG4CXX_DEBUG() became free of
> any contention.
>
> 3. This problem also has one simple solution - make one static
> std::stringstream object per thread using std::thread_local. Unfortunately
> we found one drawback - thread_local memory is not released if thread is
> not detached or joined. As there is some code which does neither of this we
> made static stringstream a xml file configuration option. Also, there could
> be an issue with using multiple MessageBuffer instances from within single
> thread, but LOG4CXX_DEBUG() is not doing that.
>
> 4. At this time we didn't do anything to address dynamic memory allocation
> issue.
>
> So, if you want to give our patch a try it is attached. It's based on
> log4cxx 0.10.0-12 as found in Debian. There are new SHARED_MUTEX and
> LOCK_R/LOCK_W macros defined in mutex.h and synchronized.h for easy
> switching between RW and convenient mutexes for benchmarking. Also, there
> is an test application which spawns two threads doing logging with some
> small sleep between iterations. It prints to stdout if trace statement took
> more than 500 microseconds. It might look familiar too you because it's
> based on one of the examples from StackOverflow.
>
> In our testing modified log4cxx has about 20% better performance overall,
> but more importantly for us it has much less cases when log statement takes
> excessive time to complete. The second part is only true for CPUs with >2
> cores where all threads can physically run in parallel. Unfortunately we
> still see rare cases with prolonged logging calls, we have traced that down
> to dynamic memory allocation issue.
>
> Any thoughts?
>
> Denys Smolianiuk
>
> Senior SW Engineer
> Harmonic Inc.
>
>

-- 
Matt Sicker <bo...@gmail.com>

Re: [log4cxx] Logging in Timing-Critical Applications

Posted by Thorsten Schöning <ts...@am-soft.de>.

Guten Tag Denys Smolainiuk,
am Donnerstag, 16. August 2018 um 17:31 schrieben Sie:

> Any thoughts?

Thanks for sharing your research, I've additionally created a ticket
to track your patches:

https://issues.apache.org/jira/browse/LOGCXX-500

Mit freundlichen Grüßen,

Thorsten Schöning

-- 
Thorsten Schöning       E-Mail: Thorsten.Schoening@AM-SoFT.de
AM-SoFT IT-Systeme      http://www.AM-SoFT.de/

Telefon...........05151-  9468- 55
Fax...............05151-  9468- 88
Mobil..............0178-8 9468- 04

AM-SoFT GmbH IT-Systeme, Brandenburger Str. 7c, 31789 Hameln
AG Hannover HRB 207 694 - Geschäftsführer: Andreas Muchow