You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Sumanth Pasupuleti (JIRA)" <ji...@apache.org> on 2018/12/06 16:47:00 UTC

[jira] [Commented] (CASSANDRA-14855) Message Flusher scheduling fell off the event loop, resulting in out of memory

    [ https://issues.apache.org/jira/browse/CASSANDRA-14855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16711679#comment-16711679 ] 

Sumanth Pasupuleti commented on CASSANDRA-14855:
------------------------------------------------

I agree with you [~benedict]. Thanks for committing. I was going to update this JIRA with yet another finding we had in production a few days ago (waiting on collecting relevant screenshots of heap dumps, etc), on the same cluster, where CPU got hogged by "something", and messages were not getting flushed from ImmediateFlusher either (my theory is lack of CPU) resulting in OOM. As you were suggesting earlier in this JIRA, will look into adding a bound to the queue of items waiting to be flushed - this would apply to trunk as well, so will spin off a different JIRA for that.
In parallel, looking into automating taking of flamegraphs during such incident, so that we would have leads into what was actually hogging the CPU.

Let me know if you have any thoughts/suggestions around this.

> Message Flusher scheduling fell off the event loop, resulting in out of memory
> ------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-14855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14855
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>            Reporter: Sumanth Pasupuleti
>            Assignee: Sumanth Pasupuleti
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.0.18
>
>         Attachments: blocked_thread_pool.png, cpu.png, eventloop_scheduledtasks.png, flusher running state.png, heap.png, heap_dump.png, read_latency.png
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> We recently had a production issue where about 10 nodes in a 96 node cluster ran out of heap. 
> From heap dump analysis, I believe there is enough evidence to indicate `queued` data member of the Flusher got too big, resulting in out of memory.
> Below are specifics on what we found from the heap dump (relevant screenshots attached):
> * non-empty "queued" data member of Flusher having retaining heap of 0.5GB, and multiple such instances.
> * "running" data member of Flusher having "true" value
> * Size of scheduledTasks on the eventloop was 0.
> We suspect something (maybe an exception) caused the Flusher running state to continue to be true, but was not able to schedule itself with the event loop.
> Could not find any ERROR in the system.log, except for following INFO logs around the incident time.
> {code:java}
> INFO [epollEventLoopGroup-2-4] 2018-xx-xx xx:xx:xx,592 Message.java:619 - Unexpected exception during request; channel = [id: 0x8d288811, L:/xxx.xx.xxx.xxx:7104 - R:/xxx.xx.x.xx:18886]
> io.netty.channel.unix.Errors$NativeIoException: readAddress() failed: Connection timed out
>  at io.netty.channel.unix.Errors.newIOException(Errors.java:117) ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at io.netty.channel.unix.Errors.ioResult(Errors.java:138) ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at io.netty.channel.unix.FileDescriptor.readAddress(FileDescriptor.java:175) ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at io.netty.channel.epoll.AbstractEpollChannel.doReadBytes(AbstractEpollChannel.java:238) ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:926) ~[netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:397) [netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:302) [netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) [netty-all-4.0.44.Final.jar:4.0.44.Final]
>  at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) [netty-all-4.0.44.Final.jar:4.0.44.Final]
> {code}
> I would like to pursue the following proposals to fix this issue:
> # ImmediateFlusher: Backport trunk's ImmediateFlusher ( [CASSANDRA-13651|https://issues.apache.org/jira/browse/CASSANDRA-13651] https://github.com/apache/cassandra/commit/96ef514917e5a4829dbe864104dbc08a7d0e0cec)  to 3.0.x and maybe to other versions as well, since ImmediateFlusher seems to be more robust than the existing Flusher as it does not depend on any running state/scheduling.
> # Make "queued" data member of the Flusher bounded to avoid any potential of causing out of memory due to otherwise unbounded nature.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org