You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Benedict (JIRA)" <ji...@apache.org> on 2015/06/02 17:45:31 UTC

[jira] [Commented] (CASSANDRA-8897) Remove FileCacheService, instead pooling the buffers

    [ https://issues.apache.org/jira/browse/CASSANDRA-8897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569296#comment-14569296 ] 

Benedict commented on CASSANDRA-8897:
-------------------------------------

Thanks [~stef1927]!

I've pushed one last round of suggestions. If you can yay/nay them, and make any comments you feel necessary, I'll then see about committing. They're more involved than I'd first intended, so please don't shy away from any criticism you consider warranted.

To summarise, as far as basic functionality is concerned, I've:

# modified the chunk queue logic to always be a prefix of non-null items and to always put the new chunk at the end, so it is always tried last
# implemented the immediate freeing of a chunk that is owned by the thread that is freeing (to keep retained memory to a minimum)
# merged the CAS to perform the recycle with the update that sets it to fully free
# some minor renaming/refactoring, such as separating the adoption of a new Chunk from the allocation of a buffer from it

Then, I've fixed a bug. This bug crept in due to the mix of suggestions I made to you, and how they combined in this instance. The bug is a pretty subtle and rare concurrency failure, wherein a thread could be late recycling a buffer that has already been recycled and is concurrently being re-adopted by another thread, immediately making it unavailable for allocation by that other thread. In most situations this would end up simply occupying some space in the owner's queue until it got evicted, but it would be possible for it to be reallocated to another thread while still avaiable and owned by the first, and trigger assertion failures.

In order to fix this bug, I've rolled back the reuse of the same Chunk instance; there are other solutions, but they are likely not worth the trouble.

In order to _exercise_ this bug, I've revamped the long test. The new version attempts to increase the level of direct interleaving of concurrent operations. Sleeps in particular mitigate this kind of behaviour, so they are avoided. Occasionally we yield, if no safe progress can be made by a test thread. There are fewer threads, and these each reduce the concurrent operations they participate in that are not directly incurred in the BufferPool. To summarise these changes:

# Utilises DynamicList from stress, that is moved into the main package (and split into LockedDynamicList and DynamicList), and permits us to maintain a per-thread collection of buffers to validate
# Constrains the buffer pool's memory so that it can see high churn and reuse of the same chunks
# Ensures the total amount of memory in use is at or around the size of the buffer pool
# Ensures that all chunks are recycled periodically (so that none are leaked/lost)
#* This involved introducing some debug logic into the BufferPool path, to track the recycling of chunks; this isn't super pretty, but is very helpful to have confidence of the behaviour
# Randomly frees buffers from its local collection, either itself or via a neighbour
# Each thread only frees buffers from one of its neighbours, so that the message passing between them can be kept extremely cheap (reducing cost for none BufferPool operations)
# Amortizes the cost of System.nanoTime(), reducing competition within it (again, non-BufferPool concurrent op)
# Introduces two new threads, whose sole job is to churn chunks at a high rate. This is to most directly increase the incidence of the bug.
# Randomly burst allocates the entire pool contents

I've also slightly modified your version, and uploaded it [here|https://github.com/belliottsmith/cassandra/tree/8897-testimprovements] with the new long test. This change was just to further increase the likelihood of exercising this bug, by hitting a failure-to-allocate in the thread that has its freeSlots ripped away from it.

> Remove FileCacheService, instead pooling the buffers
> ----------------------------------------------------
>
>                 Key: CASSANDRA-8897
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8897
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Benedict
>            Assignee: Stefania
>             Fix For: 3.x
>
>         Attachments: 8897.yaml, 9240_test_results.txt, snapshot-1431582436640-cpu-backtraces.png, snapshot-1431582436640-cpu-calltree-compression-8897.nps, snapshot-1431582436640-cpu-calltree-compression-trunk.nps
>
>
> After CASSANDRA-8893, a RAR will be a very lightweight object and will not need caching, so we can eliminate this cache entirely. Instead we should have a pool of buffers that are page-aligned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)