You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Branimir Lambov (JIRA)" <ji...@apache.org> on 2014/11/19 17:17:34 UTC
[jira] [Commented] (CASSANDRA-7075) Add the ability to automatically distribute your commitlogs across all data volumes

    [ https://issues.apache.org/jira/browse/CASSANDRA-7075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218069#comment-14218069 ] 

Branimir Lambov commented on CASSANDRA-7075:
--------------------------------------------

First draft of the multi-volume commit log can be found [here|https://github.com/blambov/cassandra/compare/7075-commitlog-volumes-2]. This is still a work in progress, but while I'm looking at ways to properly test everything, I'd be interested in some opinions on where to take this next.

To be able to spread the load between drives, the new implementation switches 'volumes' on every sync request. Each volume has its own writing thread (which in the compressed case will also be doing the compression); the segment management thread, which handles creating and recycling segments, remains shared for now. Each volume writes in its own CommitLogSegment, so in effect we may write some mutations in one segment, switch to the segment in the other drive, then switch back to writing in the first-- which means that the order of mutations is no longer defined first by the segment ID. To deal with this I exposed the concept of a 'section', which existed before as the set of mutations between two sync markers, and gave the section an ID which now replaces the segment ID in ReplayPositions. Every time we start writing to a volume, a new section with a fresh ID is created. Every time we switch volumes, a write for the old section is scheduled and either the volume is put back at the end of a queue of ready-to-use volumes (if the segment is not exhausted or there is an available reserve segment) or the management thread is woken to prepare a new segment and put the volume back in the queue when one is ready.

Because of the new ordering, commit log replay now has to be able to sort and operate on the level of sections (for new logs) as well as on the level of segments (for legacy logs). The machinery is refactored a little to permit this, and the new code is also used to select a non-conflicting section ID at start.

For full flexibility commit log volumes are configured separately from data volumes. If necessary, multiple volumes can be assigned to the same drive. With archiving it's not clear where archived logs should be restored, thus I created an option to specify that as well (with a default of sending them to the first CL volume).

The current code has more locking than I'd like, most importantly in CLSM.advanceVolume(), which is called every time a disk synchronization is requested (also when a segment is full, but that has much lower frequency). There is a noticeable impact on performance; I need more performance testing in various configurations to quantify it. I can see three ways to continue from here:

# Leave the locking as it is, which permits flexibility in the ordering of volumes in the queue. This can be made use of by making queuedVolumes a priority queue, ordered, e.g. by expected sync finish time. The latter will be able to handle heterogeneous situations (e.g. SSDs + HDDs; more importantly uneven distribution of requests from other parts of the code on the drives) very well. I think this option will result in the least complex code and the highest flexibility of the solution.
# Not permit reordering of volumes in the queue, which lets section IDs be assigned on queue entry rather than exit; with a little more work switching to a new section from the queue can be made a single compare-and-swap. In this option the load necessarily has to be spread evenly between the specified CL volumes (not necessarily between the drives as a user still may give multiple directories on the same drive). With a single CL volume and possibly in homogeneous scenarios this option should result in the best performance.
# As above, but put sections in the queue only when the previous sync for the volume has completed. This option can use the drives' performance most efficiently, but it needs another queuing layer to be able to properly deal with situations where all drives are busy and mutations are still incoming.

I'm leaning towards (1) for the flexibility, but that may be a performance regression in the single-volume case. Is it worth investing the time to try out two or all three options?

> Add the ability to automatically distribute your commitlogs across all data volumes
> -----------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-7075
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7075
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Tupshin Harper
>            Assignee: Branimir Lambov
>            Priority: Minor
>              Labels: performance
>             Fix For: 3.0
>
>
> given the prevalance of ssds (no need to separate commitlog and data), and improved jbod support, along with CASSANDRA-3578, it seems like we should have an option to have one commitlog per data volume, to even the load. i've been seeing more and more cases where there isn't an obvious "extra" volume to put the commitlog on, and sticking it on only one of the jbodded ssd volumes leads to IO imbalance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)