You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Lerh Chuan Low (JIRA)" <ji...@apache.org> on 2018/02/06 02:08:01 UTC

[jira] [Comment Edited] (CASSANDRA-8460) Make it possible to move non-compacting sstables to slow/big storage in DTCS

    [ https://issues.apache.org/jira/browse/CASSANDRA-8460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346255#comment-16346255 ] 

Lerh Chuan Low edited comment on CASSANDRA-8460 at 2/6/18 2:07 AM:
-------------------------------------------------------------------

I've tentatively started work on this, and it's turning out to be a relatively bigger code change than I was originally expecting, so would really love to get some feedback from the community who knows more (and review my initial patches). 

{{CompactionAwareWriter}}, {{DiskBoundaryManager}}, {{Directories}} and {{CompactionStrategyManager}} needs to know about archives. I've gone ahead and created a new Enumeration for `DirectoryType` that can be either ARCHIVE or STANDARD. 

{{CompactionAwareWriter}} always calls {{maybeSwitchWriter(Decorated Key)}} before calling {{realAppend}}. This is to handle the JBOD case, {{maybeSwitchWriter}} helps the writer write to the right location depending on the key to make sure keys do not overlap across directories. So it needs to have knowledge on which {{diskBoundaries}} it is actually using so as not to get into the situation where it can't differentiate between an actual archive disk and an actual JBOD disk. 

It would be wise to re-use the logic in {{diskBoundaries}} to also handle the case when the archive directory has been configured as JBOD, so {{DiskBoundaryManager}} now also needs to know about archive directories. When it tries to {{getWriteableLocations}} or generate disk boundaries, it should be able to differentiate between archive and non-archive. 

The same goes for {{CompactionStrategyManager}}. We still need to be able to run separate compaction strategy instances in the archive directory to handle the case of repairs and streaming (so archived SSTables don't just accumulate indefinitely). Here's where I am not sure which way to proceed forward. 

Option 1: 
Have it so that {{ColumnFamilyStore}} still only maintains one CSM and DBM and one {{Directories}}. CSM, DBM and {{Directories}} all start knowing about the existence of an archive directory; this can either be an extra field, or an EnumMap:

{code}
new EnumMap<Directories.DirectoryType, DiskBoundaries>(Directories.DirectoryType.class){{
            put(Directories.DirectoryType.STANDARD, cfs.getDiskBoundaries(Directories.DirectoryType.STANDARD));
            put(Directories.DirectoryType.ARCHIVE, cfs.getDiskBoundaries(Directories.DirectoryType.ARCHIVE));
        }}
{code}

The worry here for me is that some things may subtly break even as I fix up everything else that gets logged as errors...The CSM's own internal fields of {{repaired}}, {{unrepaired}} and {{pendingRepaired}} will also need to become maps, otherwise the individual instances will again become confused, being unable to differentiate between an actual JBOD disk or an archive disk. Some of the APIs, e.g reload, shutdown, enable etc will all need some smarts on which directory type is needed (in some cases it won't matter). Every consumer of these APIs will also need to be updated. 

Here's how it looks like in an initial go: 
https://github.com/apache/cassandra/compare/trunk...juiceblender:cassandra-8460-single-csm?expand=1

Option 2:
Have it so that {{ColumnFamilyStore}} keeps 2 CSMs and 2 DBMs, of which the archiving equivalents are {{null}} if not applicable/reloaded. In this case there's a reasonable level of confidence that each CSM and BDM will just 'do the right thing', regardless whether it's an archive or not. In this case then every call to getting DBM or CSM (and there are a lot for getting CSM) will need to be evaluated and checked. 

Here's how it looks like in an initial go: 
https://github.com/apache/cassandra/compare/trunk...juiceblender:cassandra-8460?expand=1

Both still have work on them (Scrubber, relocate SSTables, what happens when the archiving is turned off etc), but before I continue down the track, just wondering if anyone can point out which way is better/this is all misguided and , in the event this are the changes that need to happen (I can't seem to find a way for just TWCS to be aware that there's an archive directory, CFS needs to know as well), is this still worth the complexity introduced? 

[~pavel.trukhanov] Re "Why can't we simply allow a CS instance to spread across two disks - SSD
and corresponding archival HDD" -> I think in this case you're back in the situation where you can have data resurrected. You can have other replicas compact away tombstones (because the CS can see both directories) and then have your last remaining replica, before it manages to, get its SSD with the tombstone corrupted. Upon replacing the SSD with a new one and issuing repair, the tombstone is resurrected. Of course, this can be mitigated by making it clear to operators that every time there's a corrupt disk, every single disk needs to be replaced. 

Even if we did so, there will still be large code changes to make CSM and DBM be able to differentiate between whether the other directory it is managing really is a JBOD disk or an archive disk. 


was (Author: lerh low):
I've tentatively started work on this, and it's turning out to be a relatively bigger code change than I was originally expecting, so would really love to get some feedback from the community who knows more (and review my initial patches). 

{{CompactionAwareWriter}}, {{DiskBoundaryManager}}, {{Directories}} and {{CompactionStrategyManager}} needs to know about archives. I've gone ahead and created a new Enumeration for `DirectoryType` that can be either ARCHIVE or STANDARD. 

{{CompactionAwareWriter}} always calls {{maybeSwitchWriter(Decorated Key)}} before calling {{realAppend}}. This is to handle the JBOD case, {{maybeSwitchWriter}} helps the writer write to the right location depending on the key to make sure keys do not overlap across directories. So it needs to have knowledge on which {{diskBoundaries}} it is actually using so as not to get into the situation where it can't differentiate between an actual archive disk and an actual JBOD disk. 

It would be wise to re-use the logic in {{diskBoundaries}} to also handle the case when the archive directory has been configured as JBOD, so {{DiskBoundaryManager}} now also needs to know about archive directories. When it tries to {{getWriteableLocations}} or generate disk boundaries, it should be able to differentiate between archive and non-archive. 

The same goes for {{CompactionStrategyManager}}. We still need to be able to run separate compaction strategy instances in the archive directory to handle the case of repairs and streaming (so archived SSTables don't just accumulate indefinitely). Here's where I am not sure which way to proceed forward. 

Option 1: 
Have it so that {{ColumnFamilyStore}} still only maintains one CSM and DBM and one {{Directories}}. CSM, DBM and {{Directories}} all start knowing about the existence of an archive directory; this can either be an extra field, or an EnumMap:

{code}
new EnumMap<Directories.DirectoryType, DiskBoundaries>(Directories.DirectoryType.class){{
            put(Directories.DirectoryType.STANDARD, cfs.getDiskBoundaries(Directories.DirectoryType.STANDARD));
            put(Directories.DirectoryType.ARCHIVE, cfs.getDiskBoundaries(Directories.DirectoryType.ARCHIVE));
        }}
{code}

The worry here for me is that some things may subtly break even as I fix up everything else that gets logged as errors...The CSM's own internal fields of {{repaired}}, {{unrepaired}} and {{pendingRepaired}} will also need to become maps, otherwise the individual instances will again become confused, being unable to differentiate between an actual JBOD disk or an archive disk. Some of the APIs, e.g reload, shutdown, enable etc will all need some smarts on which directory type is needed (in some cases it won't matter). Every consumer of these APIs will also need to be updated. 

Here's how it looks like in an initial go: https://github.com/apache/cassandra/compare/trunk...juiceblender:cassandra-8460?expand=1

Option 2:
Have it so that {{ColumnFamilyStore}} keeps 2 CSMs and 2 DBMs, of which the archiving equivalents are {{null}} if not applicable/reloaded. In this case there's a reasonable level of confidence that each CSM and BDM will just 'do the right thing', regardless whether it's an archive or not. In this case then every call to getting DBM or CSM (and there are a lot for getting CSM) will need to be evaluated and checked. 

Here's how it looks like in an initial go: 
https://github.com/apache/cassandra/compare/trunk...juiceblender:cassandra-8460-single-csm?expand=1

Both still have work on them (Scrubber, relocate SSTables, what happens when the archiving is turned off etc), but before I continue down the track, just wondering if anyone can point out which way is better/this is all misguided and , in the event this are the changes that need to happen (I can't seem to find a way for just TWCS to be aware that there's an archive directory, CFS needs to know as well), is this still worth the complexity introduced? 

[~pavel.trukhanov] Re "Why can't we simply allow a CS instance to spread across two disks - SSD
and corresponding archival HDD" -> I think in this case you're back in the situation where you can have data resurrected. You can have other replicas compact away tombstones (because the CS can see both directories) and then have your last remaining replica, before it manages to, get its SSD with the tombstone corrupted. Upon replacing the SSD with a new one and issuing repair, the tombstone is resurrected. Of course, this can be mitigated by making it clear to operators that every time there's a corrupt disk, every single disk needs to be replaced. 

Even if we did so, there will still be large code changes to make CSM and DBM be able to differentiate between whether the other directory it is managing really is a JBOD disk or an archive disk. 

> Make it possible to move non-compacting sstables to slow/big storage in DTCS
> ----------------------------------------------------------------------------
>
>                 Key: CASSANDRA-8460
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8460
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Marcus Eriksson
>            Assignee: Lerh Chuan Low
>            Priority: Major
>              Labels: doc-impacting, dtcs
>             Fix For: 4.x
>
>
> It would be nice if we could configure DTCS to have a set of extra data directories where we move the sstables once they are older than max_sstable_age_days. 
> This would enable users to have a quick, small SSD for hot, new data, and big spinning disks for data that is rarely read and never compacted.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org