You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@logging.apache.org by "xendo (via GitHub)" <gi...@apache.org> on 2023/04/06 11:33:50 UTC
[PR] Add discard buffer to prevent unsychronized access when RingBuffer full (logging-log4j2)
xendo opened a new pull request, #1410:
URL: https://github.com/apache/logging-log4j2/pull/1410
A service that we maintain occasionally hangs when the rate of incoming logs is higher than the throughput of writing logs to a disk. We have a discard policy at ERROR level and use default `SynchronizeEnqueueWhenQueueFull`. We believe that this is due to corrupted LMAX disruptor. We did some initial attempts to produce minimal reproducer (using jcstress) but were not able to succeed so far.
While reviewing the log4j code we noticed that there are code paths that avoid SynchronizeEnqueueWhenQueueFull and may be responsible for RingBuffer corruption, I prepared a small change that should prevent any unsynchronized access and wanted to get initial feedback from log4j maintainers. I also wanted to use this pull request to pick you brain on what else we can do to troubleshoot the issue futher.
Writer threads hang on following stack trace:
```
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:338)
at com.lmax.disruptor.MultiProducerSequencer.next(MultiProducerSequencer.java:136)
at com.lmax.disruptor.MultiProducerSequencer.next(MultiProducerSequencer.java:105)
at com.lmax.disruptor.RingBuffer.publishEvent(RingBuffer.java:465)
at com.lmax.disruptor.dsl.Disruptor.publishEvent(Disruptor.java:326)
at org.apache.logging.log4j.core.async.AsyncLoggerDisruptor.enqueueLogMessageWhenQueueFull(AsyncLoggerDisruptor.java:236)
- locked <0x000000052818a2c0> (a java.lang.Object)
at org.apache.logging.log4j.core.async.AsyncLogger.handleRingBufferFull(AsyncLogger.java:246)
at org.apache.logging.log4j.core.async.AsyncLogger.publish(AsyncLogger.java:230)
at org.apache.logging.log4j.core.async.AsyncLogger.logWithThreadLocalTranslator(AsyncLogger.java:225)
at org.apache.logging.log4j.core.async.AsyncLogger.access$000(AsyncLogger.java:67)
at org.apache.logging.log4j.core.async.AsyncLogger$1.log(AsyncLogger.java:152)
at org.apache.logging.log4j.core.async.AsyncLogger.log(AsyncLogger.java:136)
at org.apache.logging.log4j.spi.AbstractLogger.tryLogMessage(AbstractLogger.java:2205)
at org.apache.logging.log4j.spi.AbstractLogger.logMessageTrackRecursion(AbstractLogger.java:2159)
at org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2142)
at org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:2022)
at org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1875)
at org.apache.logging.slf4j.Log4jLogger.error(Log4jLogger.java:299)
```
While the event processor is runnable, but is not making any progress, i.e. no logs are produced
```
java.lang.Thread.State: RUNNABLE
at com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:159)
at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:125)
at java.lang.Thread.run(Thread.java:750)
```
Benchmark results
---
I was a bit surprised, but comparing to previous results of those benchmarks, I didn't see a significant drop in performance for the `ENQUEUE_UNSYNCHRONIZED` . Benchmarks were run on m5.4xl instance (16 vCPU)
```
Benchmark (asyncLoggerType) (queueFullPolicy) Mode Cnt Score Error Units
ConcurrentAsyncLoggerToFileBenchmark.concurrentLoggingThreads ASYNC_CONTEXT ENQUEUE thrpt 3 1473717.926 ± 910941.806 ops/s
ConcurrentAsyncLoggerToFileBenchmark.concurrentLoggingThreads ASYNC_CONTEXT ENQUEUE_UNSYNCHRONIZED thrpt 3 1265437.867 ± 184317.376 ops/s
ConcurrentAsyncLoggerToFileBenchmark.concurrentLoggingThreads ASYNC_CONTEXT SYNCHRONOUS thrpt 3 1726704.651 ± 579862.302 ops/s
ConcurrentAsyncLoggerToFileBenchmark.concurrentLoggingThreads ASYNC_CONTEXT ENQUEUE_WITH_DISCARD_BUFFER thrpt 3 1483961.304 ± 536377.123 ops/s
ConcurrentAsyncLoggerToFileBenchmark.concurrentLoggingThreads ASYNC_CONTEXT ENQUEUE_UNSYNCHRONIZED_WITH_DISCARD_BUFFER thrpt 3 1270870.061 ± 749036.480 ops/s
ConcurrentAsyncLoggerToFileBenchmark.concurrentLoggingThreads ASYNC_CONFIG ENQUEUE thrpt 3 1327449.173 ± 295188.828 ops/s
ConcurrentAsyncLoggerToFileBenchmark.concurrentLoggingThreads ASYNC_CONFIG ENQUEUE_UNSYNCHRONIZED thrpt 3 1113508.048 ± 127087.930 ops/s
ConcurrentAsyncLoggerToFileBenchmark.concurrentLoggingThreads ASYNC_CONFIG SYNCHRONOUS thrpt 3 1762200.783 ± 491492.622 ops/s
ConcurrentAsyncLoggerToFileBenchmark.concurrentLoggingThreads ASYNC_CONFIG ENQUEUE_WITH_DISCARD_BUFFER thrpt 3 1332089.400 ± 1275770.505 ops/s
ConcurrentAsyncLoggerToFileBenchmark.concurrentLoggingThreads ASYNC_CONFIG ENQUEUE_UNSYNCHRONIZED_WITH_DISCARD_BUFFER thrpt 3 1126223.983 ± 464606.799 ops/s
ConcurrentAsyncLoggerToFileBenchmark.singleLoggingThread ASYNC_CONTEXT ENQUEUE thrpt 3 1466317.665 ± 522556.954 ops/s
ConcurrentAsyncLoggerToFileBenchmark.singleLoggingThread ASYNC_CONTEXT ENQUEUE_UNSYNCHRONIZED thrpt 3 1477131.094 ± 371409.396 ops/s
ConcurrentAsyncLoggerToFileBenchmark.singleLoggingThread ASYNC_CONTEXT SYNCHRONOUS thrpt 3 1429398.368 ± 426180.449 ops/s
ConcurrentAsyncLoggerToFileBenchmark.singleLoggingThread ASYNC_CONTEXT ENQUEUE_WITH_DISCARD_BUFFER thrpt 3 1459565.150 ± 1185274.825 ops/s
ConcurrentAsyncLoggerToFileBenchmark.singleLoggingThread ASYNC_CONTEXT ENQUEUE_UNSYNCHRONIZED_WITH_DISCARD_BUFFER thrpt 3 1502960.667 ± 380188.217 ops/s
ConcurrentAsyncLoggerToFileBenchmark.singleLoggingThread ASYNC_CONFIG ENQUEUE thrpt 3 1579091.646 ± 623213.747 ops/s
ConcurrentAsyncLoggerToFileBenchmark.singleLoggingThread ASYNC_CONFIG ENQUEUE_UNSYNCHRONIZED thrpt 3 1480321.360 ± 524851.514 ops/s
ConcurrentAsyncLoggerToFileBenchmark.singleLoggingThread ASYNC_CONFIG SYNCHRONOUS thrpt 3 1656552.128 ± 543691.433 ops/s
ConcurrentAsyncLoggerToFileBenchmark.singleLoggingThread ASYNC_CONFIG ENQUEUE_WITH_DISCARD_BUFFER thrpt 3 1580294.283 ± 436382.888 ops/s
ConcurrentAsyncLoggerToFileBenchmark.singleLoggingThread ASYNC_CONFIG ENQUEUE_UNSYNCHRONIZED_WITH_DISCARD_BUFFER thrpt 3 1550912.637 ± 148770.858 ops/s
```
Checklist
---
* Base your changes on `2.x` branch if you are targeting Log4j 2; use `main` otherwise
DONE
* `./mvnw verify` succeeds (if it fails due to code formatting issues reported by Spotless, simply run `spotless:apply` and retry)
DONE
* Changes contain an entry file in the `src/changelog/.2.x.x` directory
TBD
* Tests for the changes are provided
DONE
* [Commits are signed](https://docs.github.com/en/authentication/managing-commit-signature-verification/signing-commits) (optional, but highly recommended)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: notifications-unsubscribe@logging.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [PR] Add discard buffer to prevent unsychronized access when RingBuffer full (logging-log4j2)
Posted by "xendo (via GitHub)" <gi...@apache.org>.
xendo commented on PR #1410:
URL: https://github.com/apache/logging-log4j2/pull/1410#issuecomment-1500188250
We can reproduce this pretty reliably. There may be a secret ingredient in our setup (we use custom LogEventPatternConverters), but as I said before, we were not able to produce minimal reproducer that I could share. We initially confirmed that additional discard buffer proposed here mitigates the issue.
We looked into SYNCHRONIZED policy, but the problem we have is that it can make the performance worse when the traffic is the highest. The service we own is latency sensitive and the tradeoff we want to make is to discard the logs if we can't keep up. As far as I understand the DISCARD and SYNCHRONIZED policies are mutually exclusive. SYNCHRONIZED_DISCARD policy may be another way to solve this.
> When the buffer is full, each time an event is processed, the background thread will notify all waiting threads.
Yes, that was also something I noticed while running benchmarks, although the impact was not nearly as big as it was when you [initially worked on this](https://github.com/apache/logging-log4j2/pull/273) .
> The ringbuffer we use is configured for multi-producer-single-consumer, where concurrent requests to add data should be safe.
From what I can tell it not always is. Ideally that's what should be fixed but, to be honest, I don't know how.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: notifications-unsubscribe@logging.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
Re: [PR] Add discard buffer to prevent unsychronized access when RingBuffer full (logging-log4j2)
Posted by "carterkozak (via GitHub)" <gi...@apache.org>.
carterkozak commented on PR #1410:
URL: https://github.com/apache/logging-log4j2/pull/1410#issuecomment-1499042526
> we noticed that there are code paths that avoid SynchronizeEnqueueWhenQueueFull and may be responsible for RingBuffer corruption
The ringbuffer we use is configured for multi-producer-single-consumer, where concurrent requests to add data should be safe. We added the synchronization for a slightly different reason: When the buffer is full, each time an event is processed, the background thread will notify _all_ waiting threads. In a large web-server, this can be a lot of threads, and the system degrades into a state where most cpu time is spent notifying threads. The synchronization allows us to queue waiting threads efficiently when the buffer is full.
What is the rate at which the corrupted state reproduces in your environment? I ran into a similar issue at one point which reproduced once every few months, using the SYNCHRONOUS queue-full-policy (writing logevents from the current thread rather than blocking waiting to enqueue) for non-discarded events seemed to resolve it (if it does occur, data would be logged synchronously, but my metrics indicate that the queue is never entirely filled either).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: notifications-unsubscribe@logging.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org