You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Benedict (JIRA)" <ji...@apache.org> on 2019/05/15 11:29:00 UTC

[jira] [Comment Edited] (CASSANDRA-15013) Message Flusher queue can grow unbounded, potentially running JVM out of memory

    [ https://issues.apache.org/jira/browse/CASSANDRA-15013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16840301#comment-16840301 ] 

Benedict edited comment on CASSANDRA-15013 at 5/15/19 11:28 AM:
----------------------------------------------------------------

I've pushed some minor suggestions [here|https://github.com/belliottsmith/cassandra/tree/15013-suggestions] around naming:

# Tried to make the native_transport config parameters have more consistent naming with prior parameters - feel free to modify them further, if you think you can improve them still
# {{forceAllocate}} -> {{allocate}}, which is usually the alternative to {{tryAllocate}}
# Shortened THROW_ON_OVERLOAD parameter

There are three remaining bugs, and I've paused review until they can be addressed:

# {{this::releaseItem}} is unsafe to provide to the {{Flusher}} constructor, since these are unique to a channel, and the {{Flusher}} is per-eventLoop.  If we choose to hash all connections on an endpoint to a single eventLoop this would be easy to accommodate, or otherwise {{FlushItem}} needs to be the implementor of {{release()}}
# I don't think we can use {{Ref}} for management of the {{EndpointPayloadTracker}}.  The {{Tidy}} implementation requires a reference to the object itself, and anyway logically deleting itself after release defeats the point of Ref (which is leak detection).  It's impossible for it to detect a leak and cleanup, if the strong reference is cleaned up by this process (since there will always be a strong reference until it invokes, and it requires there to be no strong references, it will never invoke).  Probably we should use a simple AtomicInteger to manage reference counts.  I think it would be cleanest to encapsulate the map management inside a static method in {{EndpointPayloadTracker}} as well.
# I think we currently have a race condition around the release of a channel (and its {{EndpointPayloadTracker}}) and the attempt to release capacity from the {{EndpointPayloadTracker}} we have requests in flight for.  Channels can be invalidated before we complete requests issued by them, so we must be sure to release from the tracker we allocated from, so that we do not wrap into negative on release.



was (Author: benedict):
I've pushed some minor suggestions [here|https://github.com/belliottsmith/cassandra/tree/15013-suggestions] around naming:

# Tried to make the native_transport config parameters have more consistent naming with prior parameters - feel free to modify them further, if you think you can improve them still
# {{forceAllocate}} -> {{allocate}}, which is usually the alternative to {{tryAllocate}}
# Shortened THROW_ON_OVERLOAD parameter

There are three remaining bugs, and I've paused review until they can be addressed:

# {{this::releaseItem}} is unsafe to provide to the {{Flusher}} constructor, since these are unique to an address, and the {{Flusher}} is per-eventLoop.  If we choose to hash all connections on an endpoint to a single eventLoop this would be easy to accommodate, or otherwise {{FlushItem}} needs to be the implementor of {{release()}}
# I don't think we can use {{Ref}} for management of the {{EndpointPayloadTracker}}.  The {{Tidy}} implementation requires a reference to the object itself, and anyway logically deleting itself after release defeats the point of Ref (which is leak detection).  It's impossible for it to detect a leak and cleanup, if the strong reference is cleaned up by this process (since there will always be a strong reference until it invokes, and it requires there to be no strong references, it will never invoke).  Probably we should use a simple AtomicInteger to manage reference counts.  I think it would be cleanest to encapsulate the map management inside a static method in {{EndpointPayloadTracker}} as well.
# I think we currently have a race condition around the release of a channel (and its {{EndpointPayloadTracker}}) and the attempt to release capacity from the {{EndpointPayloadTracker}} we have requests in flight for.  Channels can be invalidated before we complete requests issued by them, so we must be sure to release from the tracker we allocated from, so that we do not wrap into negative on release.


> Message Flusher queue can grow unbounded, potentially running JVM out of memory
> -------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-15013
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-15013
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Messaging/Client
>            Reporter: Sumanth Pasupuleti
>            Assignee: Sumanth Pasupuleti
>            Priority: Normal
>              Labels: pull-request-available
>             Fix For: 4.0, 3.0.x, 3.11.x
>
>         Attachments: BlockedEpollEventLoopFromHeapDump.png, BlockedEpollEventLoopFromThreadDump.png, RequestExecutorQueueFull.png, heap dump showing each ImmediateFlusher taking upto 600MB.png
>
>
> This is a follow-up ticket out of CASSANDRA-14855, to make the Flusher queue bounded, since, in the current state, items get added to the queue without any checks on queue size, nor with any checks on netty outbound buffer to check the isWritable state.
> We are seeing this issue hit our production 3.0 clusters quite often.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org