You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Michael Theroux <mt...@yahoo.com> on 2012/12/15 00:58:14 UTC

Read operations resulting in a write?

Hello,

We have an unusual situation that I believe I've reproduced, at least temporarily, in a test environment.  I also think I see where this issue is occurring in the code.

We have a specific column family that is under heavy read and write load on a nightly basis.   For the purposes of this description, I'll refer to this column family as "Bob".  During this nightly processing, sometimes Bob is under very write load, other times it is very heavy read load.

The application is such that when something is written to Bob, a write is made to one of two other tables.  We've witnessed a situation where the write count on Bob far outstrips the write count on either of the other tables, by a factor of 3->10.  This is based on the WriteCount available on the column family JMX MBean.  We have not been able to find where in our code this is happening, and we have gone as far as tracing our CQL calls to determine that the relationship between Bob and the other tables are what we expect.

I brought up a test node to experiment, and see a situation where, when a "select" statement is executed, a write will occur.

In my test, I perform the following (switching between nodetool and cqlsh):

update bob set 'about'='coworker' where key='<hex key>';    
nodetool flush
update bob set 'about'='coworker' where key='<hex key>';    
nodetool flush
update bob set 'about'='coworker' where key='<hex key>';    
nodetool flush
update bob set 'about'='coworker' where key='<hex key>';    
nodetool flush
update bob set 'about'='coworker' where key='<hex key>';    
nodetool flush

Then, for a period of time (before a minor compaction occurs), a select statement that selects specific columns will cause writes to occur in the write count of the column family:

select about,changed,data from bob where key='<hex key>';

This situation will continue until a minor compaction is completed.

I went into the code and added some traces to CollationController.java:

private ColumnFamily collectTimeOrderedData() { logger.debug("collectTimeOrderedData"); ... <snip> ... 

---> HERE   logger.debug( "tables iterated: " + sstablesIterated +  " Min compact: " + cfs.getMinimumCompactionThreshold() );
// "hoist up" the requested data into a more recent sstable if (sstablesIterated > cfs.getMinimumCompactionThreshold() && !cfs.isCompactionDisabled() && cfs.getCompactionStrategy() instanceof SizeTieredCompactionStrategy) { RowMutation rm = new RowMutation(cfs.table.name, new Row(filter.key, returnCF.cloneMe())); try {
---> HERE  logger.debug( "Apply hoisted up row mutation" );
// skipping commitlog and index updates is fine since we're just de-fragmenting existing data Table.open(rm.getTable()).apply(rm, false, false); } catch (IOException e) { // log and allow the result to be returned logger.error("Error re-writing read results", e); } } 
... <snip> ...

Performing the steps above, I see the following traces (in the test environment I decreased the minimum compaction threshold to make this easier to reproduce).  After I do a couple of update/flush, I see this in the log:

DEBUG [FlushWriter:7] 2012-12-14 22:54:40,106 CompactionManager.java (line 117) Scheduling a background task check for bob with SizeTieredCompactionStrategy


Then, until compaction occurs, I see (when performing a select):

DEBUG [ScheduledTasks:1] 2012-12-14 22:55:15,998 LoadBroadcaster.java (line 86) Disseminating load info ...
DEBUG [Thrift:12] 2012-12-14 22:55:16,990 CassandraServer.java (line 1227) execute_cql_query
DEBUG [Thrift:12] 2012-12-14 22:55:16,991 QueryProcessor.java (line 445) CQL statement type: SELECT
DEBUG [Thrift:12] 2012-12-14 22:55:16,991 StorageProxy.java (line 653) Command/ConsistencyLevel is SliceByNamesReadCommand(table='open', key=804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342, columnParent='QueryPath(columnFamilyName='bob', superColumnName='null', columnName='null')', columns=[about,changed,data,])/ONE
DEBUG [Thrift:12] 2012-12-14 22:55:16,992 ReadCallback.java (line 79) Blockfor is 1; setting up requests to /10.0.4.20
DEBUG [Thrift:12] 2012-12-14 22:55:16,992 StorageProxy.java (line 669) reading data locally
DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 StorageProxy.java (line 813) LocalReadRunnable reading SliceByNamesReadCommand(table='open', key=804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342, columnParent='QueryPath(columnFamilyName='bob', superColumnName='null', columnName='null')', columns=[about,changed,data,])
DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 CollationController.java (line 68) In get top level columns: class org.apache.cassandra.db.filter.NamesQueryFilter type: Standard valid: class org.apache.cassandra.db.marshal.BytesType
DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 CollationController.java (line 84) collectTimeOrderedData
---> DEBUG [ReadStage:61] 2012-12-14 22:55:17,192 CollationController.java (line 188) tables iterated: 4 Min compact: 2

----> DEBUG [ReadStage:61] 2012-12-14 22:55:17,192 CollationController.java (line 198) Apply hoisted up row mutation
DEBUG [ReadStage:61] 2012-12-14 22:55:17,193 Table.java (line 395) applying mutation of row 804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342

The above traces will occur every time I repeat the above select statement.

Minor compaction doesn't start until a few minutes after the request was submitted above (note, this is an unloaded test node):

DEBUG [CompactionExecutor:11] 2012-12-14 22:57:03,278 IntervalNode.java (line 45) Creating IntervalNode from [Interval(DecoratedKey(Token(bytes[804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce...


Once minor compaction occurs, the behavior around write count being incremented stops, until more than the minimum compaction threshold memtables are flush to disk.

So, my questions are:

1) Am I reading things correctly?

2) What is really happening here?  Essentially minor compactions can occur between 4 and 32 memtable flushes.  Looking through the code, this seems to only effect a couple types of select statements (when selecting a specific column on a specific key being one of them). During the time between these two values, every "select" statement will perform a write.

3) Is this desired behavior?  Is there something else I should be looking at that could be causing this behavior?

We are running Cassandra 1.1.2, with SizeTieredCompactionStrategy.  
Any help is appreciated,
Thanks,
-Mike

Re: Read operations resulting in a write?

Posted by aaron morton <aa...@thelastpickle.com>.

> Could it be an issue that this optimization does not really take effect until the memtable with the hoisted data is flushed? 
No.
The read path in collectTimeOrderedData() reads from the memtable first. It then reads the SStable meta data (maxTimestamp) and checks if the candidate columns are both 1) all the columns in the query 2) the only possible values .

So immediately after the columns are hoisted a read will touch the and the sstable meta data (always in memory) for the most recent sstable.

> In my simple example below, the same row is updated and multiple selects of the same row will result in multiple writes to the memtable.
with some overlapping reads this would be possible, once one of them has completed subsequent operations would read from the memtable only. 
 
> It seems it maybe possible (although unlikely) where, if you go from a write-mostly to a read-mostly scenario, you could get into a state where you are stuck rewriting to the same memtable, and the memtable is not flushed because it absorbs the over-writes.
Memtable would still be flushed due to other CF's generating memory pressure and/or the commit log check pointing. 
Also reads go to the memtable first. 

> However, are there any hidden gotcha's there because this optimization is not occurring?  
Not that I can think off, the optimisation is not occurring because all the work getTimeOrderedData() does to read from disk has been done.
You're good to to. Assuming you have narrow rows, or a good feel for how big they will get. 

> https://issues.apache.org/jira/browse/CASSANDRA-2503 mentions a "compaction is behind" problem.  Any history on that?        I couldn't find too much information on it.
I assume it means cases where minor compaction cannot keep up. E.g. it has been throttled down, or a concurrent repair / upgradesstables is slowing things. 

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 18/12/2012, at 4:01 AM, Mike <mt...@yahoo.com> wrote:

> Thank you Aaron, this was very helpful.
> 
> Could it be an issue that this optimization does not really take effect until the memtable with the hoisted data is flushed?  In my simple example below, the same row is updated and multiple selects of the same row will result in multiple writes to the memtable.  It seems it maybe possible (although unlikely) where, if you go from a write-mostly to a read-mostly scenario, you could get into a state where you are stuck rewriting to the same memtable, and the memtable is not flushed because it absorbs the over-writes.  I can foresee this especially if you are reading the same rows repeatedly.
> 
> I also noticed from the codepaths that if Row caching is enabled, this optimization will not occur.  We made some changes this weekend to make this column family more suitable to row-caching and enabled row-caching with a small cache.  Our initial results is that it seems to have corrected the write counts, and has increased performance quite a bit.  However, are there any hidden gotcha's there because this optimization is not occurring?  https://issues.apache.org/jira/browse/CASSANDRA-2503 mentions a "compaction is behind" problem.  Any history on that?        I couldn't find too much information on it.
> 
> Thanks,
> -Mike
> 
> On 12/16/2012 8:41 PM, aaron morton wrote:
>> 
>>> 1) Am I reading things correctly?
>> Yes. 
>> If you do a read/slice by name and more than min compaction level nodes where read the data is re-written so that the next read uses fewer SSTables.
>> 
>>> 2) What is really happening here?  Essentially minor compactions can occur between 4 and 32 memtable flushes.  Looking through the code, this seems to only effect a couple types of select statements (when selecting a specific column on a specific key being one of them). During the time between these two values, every "select" statement will perform a write.
>> Yup, only for readying a row where the column names are specified.
>> Remember minor compaction when using SizedTiered Compaction (the default) works on buckets of the same size. 
>> 
>> Imagine a row that had been around for a while and had fragments in more than Min Compaction Threshold sstables. Say it is 3 SSTables in the 2nd tier and 2 sstables in the 1st. So it takes (potentially) 5 SSTable reads. If this row is read it will get hoisted back up. 
>> 
>> But the row has is in only 1 SSTable in the 2nd tier and 2 in the 1st tier it will not hoisted. 
>> 
>> There are a few short circuits in the SliceByName read path. One of them is to end the search when we know that no other         SSTables contain columns that should be considered. So if the 4 columns you read frequently are hoisted into the 1st bucket your reads will get handled by that one bucket. 
>> 
>> It's not every select. Just those that touched more the min compaction sstables. 
>> 
>> 
>>> 3) Is this desired behavior?  Is there something else I should be looking at that could be causing this behavior?
>> Yes.
>> https://issues.apache.org/jira/browse/CASSANDRA-2503
>> 
>> Cheers
>> 
>>    
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> New Zealand
>> 
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 15/12/2012, at 12:58 PM, Michael Theroux <mt...@yahoo.com> wrote:
>> 
>>> Hello,
>>> 
>>> We have an unusual situation that I believe I've reproduced, at least temporarily, in a test environment.  I also think I see where this issue is occurring in the code.
>>> 
>>> We have a specific column family that is under heavy read and write load on a nightly basis.   For the purposes of this description, I'll refer to this column family as "Bob".  During this nightly processing, sometimes Bob is under very write load, other times it is very heavy read load.
>>> 
>>> The application is such that when something is written to Bob, a write is made to one of two other tables.  We've witnessed a situation where the write count on Bob far outstrips the write count on either of the other tables, by a factor of 3->10.  This is based on the WriteCount available on the column family JMX MBean.  We have not been able to find where in our code this is happening, and we have gone as far as tracing our CQL calls to determine that the relationship between Bob and the other tables are what we expect.
>>> 
>>> I brought up a test node to experiment, and see a situation where, when a "select" statement is executed, a write will occur.
>>> 
>>> In my test, I perform the following (switching between nodetool and cqlsh):
>>> 
>>> update bob set 'about'='coworker' where key='<hex key>';    
>>> nodetool flush
>>> update bob set 'about'='coworker' where key='<hex key>';    
>>> nodetool flush
>>> update bob set 'about'='coworker' where key='<hex key>';    
>>> nodetool flush
>>> update bob set 'about'='coworker' where key='<hex key>';    
>>> nodetool flush
>>> update bob set 'about'='coworker' where key='<hex key>';    
>>> nodetool flush
>>> 
>>> Then, for a period of time (before a minor compaction occurs), a select statement that selects specific columns will cause writes to occur in the write count of the column family:
>>> 
>>> select about,changed,data from bob where key='<hex key>';
>>> 
>>> This situation will continue until a minor compaction is completed.
>>> 
>>> I went into the code and added some traces to CollationController.java:
>>> 
>>>    private ColumnFamily collectTimeOrderedData()
>>>     {
>>>         logger.debug("collectTimeOrderedData");
>>> 
>>>       ... <snip> ...
>>> 
>>> ---> HERE   logger.debug( "tables iterated: " + sstablesIterated +  " Min compact: " + cfs.getMinimumCompactionThreshold() );
>>>             // "hoist up" the requested data into a more recent sstable
>>>             if (sstablesIterated > cfs.getMinimumCompactionThreshold()
>>>                 && !cfs.isCompactionDisabled()
>>>                 && cfs.getCompactionStrategy() instanceof SizeTieredCompactionStrategy)
>>>             {
>>>                 RowMutation rm = new RowMutation(cfs.table.name, new Row(filter.key, returnCF.cloneMe()));
>>>                 try
>>>                 {
>>> ---> HERE               logger.debug( "Apply hoisted up row mutation" );	
>>>                     // skipping commitlog and index updates is fine since we're just de-fragmenting existing data
>>>                     Table.open(rm.getTable()).apply(rm, false, false);
>>>                 }
>>>                 catch (IOException e)
>>>                 {
>>>                     // log and allow the result to be returned
>>>                     logger.error("Error re-writing read results", e);
>>>                 }
>>>             } 
>>> ...
>>>                           <snip> ...
>>> 
>>> 
>>>                         
>>> Performing
>>>                           the steps above, I see the following traces
>>>                           (in the test environment I decreased the
>>>                           minimum compaction threshold to make this
>>>                           easier to reproduce). After I do a couple of
>>>                           update/flush, I see this in the log:
>>> 
>>> 
>>>                         
>>> DEBUG [FlushWriter:7] 2012-12-14 22:54:40,106 CompactionManager.java (line 117) Scheduling a background task check for bob with SizeTieredCompactionStrategy
>>> 
>>>                         
>>> 
>>> Then, until compaction occurs, I see (when performing a select):
>>> 
>>> DEBUG [ScheduledTasks:1] 2012-12-14 22:55:15,998 LoadBroadcaster.java (line 86) Disseminating load info ...
>>> DEBUG [Thrift:12] 2012-12-14 22:55:16,990 CassandraServer.java (line 1227) execute_cql_query
>>> DEBUG [Thrift:12] 2012-12-14 22:55:16,991 QueryProcessor.java (line 445) CQL statement type: SELECT
>>> DEBUG [Thrift:12] 2012-12-14 22:55:16,991 StorageProxy.java (line 653) Command/ConsistencyLevel is SliceByNamesReadCommand(table='open', key=804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342, columnParent='QueryPath(columnFamilyName='bob', superColumnName='null', columnName='null')', columns=[about,changed,data,])/ONE
>>> DEBUG [Thrift:12] 2012-12-14 22:55:16,992 ReadCallback.java (line 79) Blockfor is 1; setting up requests to /10.0.4.20
>>> DEBUG [Thrift:12] 2012-12-14 22:55:16,992 StorageProxy.java (line 669) reading data locally
>>> DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 StorageProxy.java (line 813) LocalReadRunnable reading SliceByNamesReadCommand(table='open', key=804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342, columnParent='QueryPath(columnFamilyName='bob', superColumnName='null', columnName='null')', columns=[about,changed,data,])
>>> DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 CollationController.java (line 68) In get top level columns: class org.apache.cassandra.db.filter.NamesQueryFilter type: Standard valid: class org.apache.cassandra.db.marshal.BytesType
>>> DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 CollationController.java (line 84) collectTimeOrderedData
>>> ---> DEBUG [ReadStage:61] 2012-12-14 22:55:17,192 CollationController.java (line 188) tables iterated: 4 Min compact: 2
>>> ----> DEBUG [ReadStage:61] 2012-12-14 22:55:17,192 CollationController.java (line 198) Apply hoisted up row mutation
>>> DEBUG [ReadStage:61] 2012-12-14 22:55:17,193 Table.java (line 395) applying mutation of row 804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342
>>> 
>>> The above traces will occur every time I repeat the above select statement.
>>> 
>>> Minor compaction doesn't start until a few minutes after the request was submitted above (note, this is an unloaded test node):
>>> 
>>> DEBUG [CompactionExecutor:11] 2012-12-14 22:57:03,278 IntervalNode.java (line 45) Creating IntervalNode from [Interval(DecoratedKey(Token(bytes[804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce...
>>> 
>>> Once minor compaction occurs, the behavior around write count being incremented stops, until more than the minimum compaction threshold memtables are flush to disk.
>>> 
>>> So, my questions are:
>>> 
>>> 1) Am I reading things correctly?
>>> 
>>> 2) What is really happening here?  Essentially minor compactions can occur between 4 and 32 memtable flushes.  Looking through the code, this seems to only effect a couple types of select statements (when selecting a specific column on a specific key being one of them). During the time between these two values, every "select" statement will perform a write.
>>> 
>>> 3) Is this desired behavior?  Is there something else I should be looking at that could be causing this behavior?
>>> 
>>> We are running Cassandra 1.1.2, with SizeTieredCompactionStrategy.  
>>> Any help is appreciated,
>>> Thanks,
>>> -Mike
>>> 
>>> 
>>>  
>> 
>

Re: Read operations resulting in a write?

Posted by aaron morton <aa...@thelastpickle.com>.

AFAIK there is no way to disable hoisting. 

Feel free to let your jira fingers do the talking. 

Cheers

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 18/12/2012, at 6:10 PM, Edward Capriolo <ed...@gmail.com> wrote:

> Is there a way to turn this on and off through configuration? I am not necessarily sure I would want this feature. Also it is confusing if these writes show up in JMX and look like user generated write operations.
> 
> 
> On Mon, Dec 17, 2012 at 10:01 AM, Mike <mt...@yahoo.com> wrote:
> Thank you Aaron, this was very helpful.
> 
> Could it be an issue that this optimization does not really take effect until the memtable with the hoisted data is flushed?  In my simple example below, the same row is updated and multiple selects of the same row will result in multiple writes to the memtable.  It seems it maybe possible (although unlikely) where, if you go from a write-mostly to a read-mostly scenario, you could get into a state where you are stuck rewriting to the same memtable, and the memtable is not flushed because it absorbs the over-writes.  I can foresee this especially if you are reading the same rows repeatedly.
> 
> I also noticed from the codepaths that if Row caching is enabled, this optimization will not occur.  We made some changes this weekend to make this column family more suitable to row-caching and enabled row-caching with a small cache.  Our initial results is that it seems to have corrected the write counts, and has increased performance quite a bit.  However, are there any hidden gotcha's there because this optimization is not occurring?  https://issues.apache.org/jira/browse/CASSANDRA-2503 mentions a "compaction is behind" problem.  Any history on that?  I couldn't find too much information on it.
> 
> Thanks,
> -Mike
> 
> On 12/16/2012 8:41 PM, aaron morton wrote:
>> 
>>> 1) Am I reading things correctly?
>> Yes. 
>> If you do a read/slice by name and more than min compaction level nodes where read the data is re-written so that the next read uses fewer SSTables.
>> 
>>> 2) What is really happening here?  Essentially minor compactions can occur between 4 and 32 memtable flushes.  Looking through the code, this seems to only effect a couple types of select statements (when selecting a specific column on a specific key being one of them). During the time between these two values, every "select" statement will perform a write.
>> Yup, only for readying a row where the column names are specified.
>> Remember minor compaction when using SizedTiered Compaction (the default) works on buckets of the same size. 
>> 
>> Imagine a row that had been around for a while and had fragments in more than Min Compaction Threshold sstables. Say it is 3 SSTables in the 2nd tier and 2 sstables in the 1st. So it takes (potentially) 5 SSTable reads. If this row is read it will get hoisted back up. 
>> 
>> But the row has is in only 1 SSTable in the 2nd tier and 2 in the 1st tier it will not hoisted. 
>> 
>> There are a few short circuits in the SliceByName read path. One of them is to end the search when we know that no other SSTables contain columns that should be considered. So if the 4 columns you read frequently are hoisted into the 1st bucket your reads will get handled by that one bucket. 
>> 
>> It's not every select. Just those that touched more the min compaction sstables. 
>> 
>> 
>>> 3) Is this desired behavior?  Is there something else I should be looking at that could be causing this behavior?
>> Yes.
>> https://issues.apache.org/jira/browse/CASSANDRA-2503
>> 
>> Cheers
>> 
>>    
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> New Zealand
>> 
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 15/12/2012, at 12:58 PM, Michael Theroux <mt...@yahoo.com> wrote:
>> 
>>> Hello,
>>> 
>>> We have an unusual situation that I believe I've reproduced, at least temporarily, in a test environment.  I also think I see where this issue is occurring in the code.
>>> 
>>> We have a specific column family that is under heavy read and write load on a nightly basis.   For the purposes of this description, I'll refer to this column family as "Bob".  During this nightly processing, sometimes Bob is under very write load, other times it is very heavy read load.
>>> 
>>> The application is such that when something is written to Bob, a write is made to one of two other tables.  We've witnessed a situation where the write count on Bob far outstrips the write count on either of the other tables, by a factor of 3->10.  This is based on the WriteCount available on the column family JMX MBean.  We have not been able to find where in our code this is happening, and we have gone as far as tracing our CQL calls to determine that the relationship between Bob and the other tables are what we expect.
>>> 
>>> I brought up a test node to experiment, and see a situation where, when a "select" statement is executed, a write will occur.
>>> 
>>> In my test, I perform the following (switching between nodetool and cqlsh):
>>> 
>>> update bob set 'about'='coworker' where key='<hex key>';    
>>> nodetool flush
>>> update bob set 'about'='coworker' where key='<hex key>';    
>>> nodetool flush
>>> update bob set 'about'='coworker' where key='<hex key>';    
>>> nodetool flush
>>> update bob set 'about'='coworker' where key='<hex key>';    
>>> nodetool flush
>>> update bob set 'about'='coworker' where key='<hex key>';    
>>> nodetool flush
>>> 
>>> Then, for a period of time (before a minor compaction occurs), a select statement that selects specific columns will cause writes to occur in the write count of the column family:
>>> 
>>> select about,changed,data from bob where key='<hex key>';
>>> 
>>> This situation will continue until a minor compaction is completed.
>>> 
>>> I went into the code and added some traces to CollationController.java:
>>> 
>>>    private ColumnFamily collectTimeOrderedData()
>>>     {
>>>         logger.debug("collectTimeOrderedData");
>>> 
>>>       ... <snip> ...
>>> 
>>> ---> HERE   logger.debug( "tables iterated: " + sstablesIterated +  " Min compact: " + cfs.getMinimumCompactionThreshold() );
>>>             // "hoist up" the requested data into a more recent sstable
>>>             if (sstablesIterated > cfs.getMinimumCompactionThreshold()
>>>                 && !cfs.isCompactionDisabled()
>>>                 && cfs.getCompactionStrategy() instanceof SizeTieredCompactionStrategy)
>>>             {
>>>                 RowMutation rm = new RowMutation(cfs.table.name, new Row(filter.key, returnCF.cloneMe()));
>>>                 try
>>>                 {
>>> ---> HERE               logger.debug( "Apply hoisted up row mutation" );	
>>>                     // skipping commitlog and index updates is fine since we're just de-fragmenting existing data
>>>                     Table.open(rm.getTable()).apply(rm, false, false);
>>>                 }
>>>                 catch (IOException e)
>>>                 {
>>>                     // log and allow the result to be returned
>>>                     logger.error("Error re-writing read results", e);
>>>                 }
>>>             } 
>>> ...
>>>                           <snip> ...
>>> 
>>> 
>>>                         
>>> Performing
>>>                           the steps above, I see the following traces
>>>                           (in the test environment I decreased the
>>>                           minimum compaction threshold to make this
>>>                           easier to reproduce). After I do a couple of
>>>                           update/flush, I see this in the log:
>>> 
>>> 
>>>                         
>>> DEBUG [FlushWriter:7] 2012-12-14 22:54:40,106 CompactionManager.java (line 117) Scheduling a background task check for bob with SizeTieredCompactionStrategy
>>> 
>>>                         
>>> 
>>> Then, until compaction occurs, I see (when performing a select):
>>> 
>>> DEBUG [ScheduledTasks:1] 2012-12-14 22:55:15,998 LoadBroadcaster.java (line 86) Disseminating load info ...
>>> DEBUG [Thrift:12] 2012-12-14 22:55:16,990 CassandraServer.java (line 1227) execute_cql_query
>>> DEBUG [Thrift:12] 2012-12-14 22:55:16,991 QueryProcessor.java (line 445) CQL statement type: SELECT
>>> DEBUG [Thrift:12] 2012-12-14 22:55:16,991 StorageProxy.java (line 653) Command/ConsistencyLevel is SliceByNamesReadCommand(table='open', key=804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342, columnParent='QueryPath(columnFamilyName='bob', superColumnName='null', columnName='null')', columns=[about,changed,data,])/ONE
>>> DEBUG [Thrift:12] 2012-12-14 22:55:16,992 ReadCallback.java (line 79) Blockfor is 1; setting up requests to /10.0.4.20
>>> DEBUG [Thrift:12] 2012-12-14 22:55:16,992 StorageProxy.java (line 669) reading data locally
>>> DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 StorageProxy.java (line 813) LocalReadRunnable reading SliceByNamesReadCommand(table='open', key=804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342, columnParent='QueryPath(columnFamilyName='bob', superColumnName='null', columnName='null')', columns=[about,changed,data,])
>>> DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 CollationController.java (line 68) In get top level columns: class org.apache.cassandra.db.filter.NamesQueryFilter type: Standard valid: class org.apache.cassandra.db.marshal.BytesType
>>> DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 CollationController.java (line 84) collectTimeOrderedData
>>> ---> DEBUG [ReadStage:61] 2012-12-14 22:55:17,192 CollationController.java (line 188) tables iterated: 4 Min compact: 2
>>> ----> DEBUG [ReadStage:61] 2012-12-14 22:55:17,192 CollationController.java (line 198) Apply hoisted up row mutation
>>> DEBUG [ReadStage:61] 2012-12-14 22:55:17,193 Table.java (line 395) applying mutation of row 804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342
>>> 
>>> The above traces will occur every time I repeat the above select statement.
>>> 
>>> Minor compaction doesn't start until a few minutes after the request was submitted above (note, this is an unloaded test node):
>>> 
>>> DEBUG [CompactionExecutor:11] 2012-12-14 22:57:03,278 IntervalNode.java (line 45) Creating IntervalNode from [Interval(DecoratedKey(Token(bytes[804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce...
>>> 
>>> Once minor compaction occurs, the behavior around write count being incremented stops, until more than the minimum compaction threshold memtables are flush to disk.
>>> 
>>> So, my questions are:
>>> 
>>> 1) Am I reading things correctly?
>>> 
>>> 2) What is really happening here?  Essentially minor compactions can occur between 4 and 32 memtable flushes.  Looking through the code, this seems to only effect a couple types of select statements (when selecting a specific column on a specific key being one of them). During the time between these two values, every "select" statement                       will perform a write.
>>> 
>>> 3) Is this desired behavior?  Is there something else I should be looking at that could be causing this behavior?
>>> 
>>> We are running Cassandra 1.1.2, with SizeTieredCompactionStrategy.  
>>> Any help is appreciated,
>>> Thanks,
>>> -Mike
>>> 
>>> 
>>>  
>> 
> 
>

Re: Read operations resulting in a write?

Posted by Edward Capriolo <ed...@gmail.com>.

Is there a way to turn this on and off through configuration? I am not
necessarily sure I would want this feature. Also it is confusing if these
writes show up in JMX and look like user generated write operations.


On Mon, Dec 17, 2012 at 10:01 AM, Mike <mt...@yahoo.com> wrote:

>  Thank you Aaron, this was very helpful.
>
> Could it be an issue that this optimization does not really take effect
> until the memtable with the hoisted data is flushed?  In my simple example
> below, the same row is updated and multiple selects of the same row will
> result in multiple writes to the memtable.  It seems it maybe possible
> (although unlikely) where, if you go from a write-mostly to a read-mostly
> scenario, you could get into a state where you are stuck rewriting to the
> same memtable, and the memtable is not flushed because it absorbs the
> over-writes.  I can foresee this especially if you are reading the same
> rows repeatedly.
>
> I also noticed from the codepaths that if Row caching is enabled, this
> optimization will not occur.  We made some changes this weekend to make
> this column family more suitable to row-caching and enabled row-caching
> with a small cache.  Our initial results is that it seems to have corrected
> the write counts, and has increased performance quite a bit.  However, are
> there any hidden gotcha's there because this optimization is not
> occurring?  https://issues.apache.org/jira/browse/CASSANDRA-2503 mentions
> a "compaction is behind" problem.  Any history on that?  I couldn't find
> too much information on it.
>
> Thanks,
> -Mike
>
> On 12/16/2012 8:41 PM, aaron morton wrote:
>
>
>   1) Am I reading things correctly?
>
> Yes.
> If you do a read/slice by name and more than min compaction level nodes
> where read the data is re-written so that the next read uses fewer SSTables.
>
>  2) What is really happening here?  Essentially minor compactions can
> occur between 4 and 32 memtable flushes.  Looking through the code, this
> seems to only effect a couple types of select statements (when selecting a
> specific column on a specific key being one of them). During the time
> between these two values, every "select" statement will perform a write.
>
> Yup, only for readying a row where the column names are specified.
> Remember minor compaction when using SizedTiered Compaction (the default)
> works on buckets of the same size.
>
>  Imagine a row that had been around for a while and had fragments in more
> than Min Compaction Threshold sstables. Say it is 3 SSTables in the 2nd
> tier and 2 sstables in the 1st. So it takes (potentially) 5 SSTable reads.
> If this row is read it will get hoisted back up.
>
>  But the row has is in only 1 SSTable in the 2nd tier and 2 in the 1st
> tier it will not hoisted.
>
>  There are a few short circuits in the SliceByName read path. One of them
> is to end the search when we know that no other SSTables contain columns
> that should be considered. So if the 4 columns you read frequently are
> hoisted into the 1st bucket your reads will get handled by that one bucket.
>
>  It's not every select. Just those that touched more the min compaction
> sstables.
>
>
>  3) Is this desired behavior?  Is there something else I should be
> looking at that could be causing this behavior?
>
> Yes.
> https://issues.apache.org/jira/browse/CASSANDRA-2503
>
>  Cheers
>
>
>    -----------------
> Aaron Morton
> Freelance Cassandra Developer
> New Zealand
>
>  @aaronmorton
> http://www.thelastpickle.com
>
>  On 15/12/2012, at 12:58 PM, Michael Theroux <mt...@yahoo.com> wrote:
>
>  Hello,
>
>  We have an unusual situation that I believe I've reproduced, at least
> temporarily, in a test environment.  I also think I see where this issue is
> occurring in the code.
>
>  We have a specific column family that is under heavy read and write load
> on a nightly basis.   For the purposes of this description, I'll refer to
> this column family as "Bob".  During this nightly processing, sometimes Bob
> is under very write load, other times it is very heavy read load.
>
>  The application is such that when something is written to Bob, a write
> is made to one of two other tables.  We've witnessed a situation where the
> write count on Bob far outstrips the write count on either of the other
> tables, by a factor of 3->10.  This is based on the WriteCount available on
> the column family JMX MBean.  We have not been able to find where in our
> code this is happening, and we have gone as far as tracing our CQL calls to
> determine that the relationship between Bob and the other tables are what
> we expect.
>
>  I brought up a test node to experiment, and see a situation where, when
> a "select" statement is executed, a write will occur.
>
>  In my test, I perform the following (switching between nodetool and
> cqlsh):
>
>  update bob set 'about'='coworker' where key='<hex key>';
> nodetool flush
>  update bob set 'about'='coworker' where key='<hex key>';
> nodetool flush
>  update bob set 'about'='coworker' where key='<hex key>';
> nodetool flush
>  update bob set 'about'='coworker' where key='<hex key>';
> nodetool flush
>  update bob set 'about'='coworker' where key='<hex key>';
> nodetool flush
>
>  Then, for a period of time (before a minor compaction occurs), a select
> statement that selects specific columns will cause writes to occur in the
> write count of the column family:
>
>  select about,changed,data from bob where key='<hex key>';
>
>  This situation will continue until a minor compaction is completed.
>
>  I went into the code and added some traces to CollationController.java:
>
>    private ColumnFamily collectTimeOrderedData()
>     {
>         logger.debug("collectTimeOrderedData");
>
>       ... <snip> ...
>
>
> ---> HERE   logger.debug( "tables iterated: " + sstablesIterated +  " Min compact: " + cfs.getMinimumCompactionThreshold() );
>
>             // "hoist up" the requested data into a more recent sstable
>             if (sstablesIterated > cfs.getMinimumCompactionThreshold()
>                 && !cfs.isCompactionDisabled()
>                 && cfs.getCompactionStrategy() instanceof SizeTieredCompactionStrategy)
>             {
>                 RowMutation rm = new RowMutation(cfs.table.name, new Row(filter.key, returnCF.cloneMe()));
>                 try
>                 {
>
> ---> HERE               logger.debug( "Apply hoisted up row mutation" );	
>
>                     // skipping commitlog and index updates is fine since we're just de-fragmenting existing data
>                     Table.open(rm.getTable()).apply(rm, false, false);
>                 }
>                 catch (IOException e)
>                 {
>                     // log and allow the result to be returned
>                     logger.error("Error re-writing read results", e);
>                 }
>             }
>
>  ... <snip> ...
>
>  Performing the steps above, I see the following traces (in the test
> environment I decreased the minimum compaction threshold to make this
> easier to reproduce). After I do a couple of update/flush, I see this in
> the log:
>
>  DEBUG [FlushWriter:7] 2012-12-14 22:54:40,106 CompactionManager.java
> (line 117) Scheduling a background task check for bob with
> SizeTieredCompactionStrategy
>
>  Then, until compaction occurs, I see (when performing a select):
>
>  DEBUG [ScheduledTasks:1] 2012-12-14 22:55:15,998 LoadBroadcaster.java
> (line 86) Disseminating load info ...
> DEBUG [Thrift:12] 2012-12-14 22:55:16,990 CassandraServer.java (line 1227)
> execute_cql_query
> DEBUG [Thrift:12] 2012-12-14 22:55:16,991 QueryProcessor.java (line 445)
> CQL statement type: SELECT
> DEBUG [Thrift:12] 2012-12-14 22:55:16,991 StorageProxy.java (line 653)
> Command/ConsistencyLevel is SliceByNamesReadCommand(table='open',
> key=804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342,
> columnParent='QueryPath(columnFamilyName='bob', superColumnName='null',
> columnName='null')', columns=[about,changed,data,])/ONE
> DEBUG [Thrift:12] 2012-12-14 22:55:16,992 ReadCallback.java (line 79)
> Blockfor is 1; setting up requests to /10.0.4.20
> DEBUG [Thrift:12] 2012-12-14 22:55:16,992 StorageProxy.java (line 669)
> reading data locally
> DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 StorageProxy.java (line 813)
> LocalReadRunnable reading SliceByNamesReadCommand(table='open',
> key=804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342,
> columnParent='QueryPath(columnFamilyName='bob', superColumnName='null',
> columnName='null')', columns=[about,changed,data,])
> DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 CollationController.java
> (line 68) In get top level columns: class
> org.apache.cassandra.db.filter.NamesQueryFilter type: Standard valid: class
> org.apache.cassandra.db.marshal.BytesType
> DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 CollationController.java
> (line 84) collectTimeOrderedData
> ---> DEBUG [ReadStage:61] 2012-12-14 22:55:17,192 CollationController.java
> (line 188) tables iterated: 4 Min compact: 2
>  ----> DEBUG [ReadStage:61] 2012-12-14 22:55:17,192
> CollationController.java (line 198) Apply hoisted up row mutation
> DEBUG [ReadStage:61] 2012-12-14 22:55:17,193 Table.java (line 395)
> applying mutation of row
> 804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342
>
>  The above traces will occur every time I repeat the above select
> statement.
>
>  Minor compaction doesn't start until a few minutes after the request was
> submitted above (note, this is an unloaded test node):
>
>  DEBUG [CompactionExecutor:11] 2012-12-14 22:57:03,278 IntervalNode.java
> (line 45) Creating IntervalNode from
> [Interval(DecoratedKey(Token(bytes[804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce...
>
>  Once minor compaction occurs, the behavior around write count being
> incremented stops, until more than the minimum compaction threshold
> memtables are flush to disk.
>
>  So, my questions are:
>
>  1) Am I reading things correctly?
>
>  2) What is really happening here?  Essentially minor compactions can
> occur between 4 and 32 memtable flushes.  Looking through the code, this
> seems to only effect a couple types of select statements (when selecting a
> specific column on a specific key being one of them). During the time
> between these two values, every "select" statement will perform a write.
>
>  3) Is this desired behavior?  Is there something else I should be
> looking at that could be causing this behavior?
>
>  We are running Cassandra 1.1.2, with SizeTieredCompactionStrategy.
> Any help is appreciated,
> Thanks,
> -Mike
>
>
>
>
>
>
>

Re: Read operations resulting in a write?

Posted by Mike <mt...@yahoo.com>.

Thank you Aaron, this was very helpful.

Could it be an issue that this optimization does not really take effect 
until the memtable with the hoisted data is flushed?  In my simple 
example below, the same row is updated and multiple selects of the same 
row will result in multiple writes to the memtable. It seems it maybe 
possible (although unlikely) where, if you go from a write-mostly to a 
read-mostly scenario, you could get into a state where you are stuck 
rewriting to the same memtable, and the memtable is not flushed because 
it absorbs the over-writes.  I can foresee this especially if you are 
reading the same rows repeatedly.

I also noticed from the codepaths that if Row caching is enabled, this 
optimization will not occur.  We made some changes this weekend to make 
this column family more suitable to row-caching and enabled row-caching 
with a small cache.  Our initial results is that it seems to have 
corrected the write counts, and has increased performance quite a bit.  
However, are there any hidden gotcha's there because this optimization 
is not occurring? https://issues.apache.org/jira/browse/CASSANDRA-2503 
mentions a "compaction is behind" problem.  Any history on that? I 
couldn't find too much information on it.

Thanks,
-Mike

On 12/16/2012 8:41 PM, aaron morton wrote:
>
>> 1) Am I reading things correctly?
> Yes.
> If you do a read/slice by name and more than min compaction level 
> nodes where read the data is re-written so that the next read uses 
> fewer SSTables.
>
>> 2) What is really happening here?  Essentially minor compactions can 
>> occur between 4 and 32 memtable flushes.  Looking through the code, 
>> this seems to only effect a couple types of select statements (when 
>> selecting a specific column on a specific key being one of them). 
>> During the time between these two values, every "select" statement 
>> will perform a write.
> Yup, only for readying a row where the column names are specified.
> Remember minor compaction when using SizedTiered Compaction (the 
> default) works on buckets of the same size.
>
> Imagine a row that had been around for a while and had fragments in 
> more than Min Compaction Threshold sstables. Say it is 3 SSTables in 
> the 2nd tier and 2 sstables in the 1st. So it takes (potentially) 5 
> SSTable reads. If this row is read it will get hoisted back up.
>
> But the row has is in only 1 SSTable in the 2nd tier and 2 in the 1st 
> tier it will not hoisted.
>
> There are a few short circuits in the SliceByName read path. One of 
> them is to end the search when we know that no other SSTables contain 
> columns that should be considered. So if the 4 columns you read 
> frequently are hoisted into the 1st bucket your reads will get handled 
> by that one bucket.
>
> It's not every select. Just those that touched more the min compaction 
> sstables.
>
>
>> 3) Is this desired behavior?  Is there something else I should be 
>> looking at that could be causing this behavior?
> Yes.
> https://issues.apache.org/jira/browse/CASSANDRA-2503
>
> Cheers
>
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 15/12/2012, at 12:58 PM, Michael Theroux <mtheroux2@yahoo.com 
> <ma...@yahoo.com>> wrote:
>
>> Hello,
>>
>> We have an unusual situation that I believe I've reproduced, at least 
>> temporarily, in a test environment.  I also think I see where this 
>> issue is occurring in the code.
>>
>> We have a specific column family that is under heavy read and write 
>> load on a nightly basis.   For the purposes of this description, I'll 
>> refer to this column family as "Bob".  During this nightly 
>> processing, sometimes Bob is under very write load, other times it is 
>> very heavy read load.
>>
>> The application is such that when something is written to Bob, a 
>> write is made to one of two other tables.  We've witnessed a 
>> situation where the write count on Bob far outstrips the write count 
>> on either of the other tables, by a factor of 3->10.  This is based 
>> on the WriteCount available on the column family JMX MBean.  We have 
>> not been able to find where in our code this is happening, and we 
>> have gone as far as tracing our CQL calls to determine that the 
>> relationship between Bob and the other tables are what we expect.
>>
>> I brought up a test node to experiment, and see a situation where, 
>> when a "select" statement is executed, a write will occur.
>>
>> In my test, I perform the following (switching between nodetool and 
>> cqlsh):
>>
>> update bob set 'about'='coworker' where key='<hex key>';
>> nodetool flush
>> update bob set 'about'='coworker' where key='<hex key>';
>> nodetool flush
>> update bob set 'about'='coworker' where key='<hex key>';
>> nodetool flush
>> update bob set 'about'='coworker' where key='<hex key>';
>> nodetool flush
>> update bob set 'about'='coworker' where key='<hex key>';
>> nodetool flush
>>
>> Then, for a period of time (before a minor compaction occurs), a 
>> select statement that selects specific columns will cause writes to 
>> occur in the write count of the column family:
>>
>> select about,changed,data from bob where key='<hex key>';
>>
>> This situation will continue until a minor compaction is completed.
>>
>> I went into the code and added some traces to CollationController.java:
>>
>>    private ColumnFamily collectTimeOrderedData()
>>      {
>>          logger.debug("collectTimeOrderedData");
>>
>>        ... <snip> ...
>>
>> ---> HERE   logger.debug( "tables iterated: " + sstablesIterated +  " Min compact: " + cfs.getMinimumCompactionThreshold() );
>>              // "hoist up" the requested data into a more recent sstable
>>              if (sstablesIterated > cfs.getMinimumCompactionThreshold()
>>                  && !cfs.isCompactionDisabled()
>>                  && cfs.getCompactionStrategy() instanceof SizeTieredCompactionStrategy)
>>              {
>>                  RowMutation rm = new RowMutation(cfs.table.name, new Row(filter.key, returnCF.cloneMe()));
>>                  try
>>                  {
>> ---> HERE             logger.debug( "Apply hoisted up row mutation" );	
>>                      // skipping commitlog and index updates is fine since we're just de-fragmenting existing data
>>                      Table.open(rm.getTable()).apply(rm, false, false);
>>                  }
>>                  catch (IOException e)
>>                  {
>>                      // log and allow the result to be returned
>>                      logger.error("Error re-writing read results", e);
>>                  }
>>              }
>> ... <snip> ...
>>
>> Performing the steps above, I see the following traces (in the test 
>> environment I decreased the minimum compaction threshold to make this 
>> easier to reproduce). After I do a couple of update/flush, I see this 
>> in the log:
>>
>> DEBUG [FlushWriter:7] 2012-12-14 22:54:40,106 CompactionManager.java 
>> (line 117) Scheduling a background task check for bob with 
>> SizeTieredCompactionStrategy
>>
>> Then, until compaction occurs, I see (when performing a select):
>>
>> DEBUG [ScheduledTasks:1] 2012-12-14 22:55:15,998 LoadBroadcaster.java 
>> (line 86) Disseminating load info ...
>> DEBUG [Thrift:12] 2012-12-14 22:55:16,990 CassandraServer.java (line 
>> 1227) execute_cql_query
>> DEBUG [Thrift:12] 2012-12-14 22:55:16,991 QueryProcessor.java (line 
>> 445) CQL statement type: SELECT
>> DEBUG [Thrift:12] 2012-12-14 22:55:16,991 StorageProxy.java (line 
>> 653) Command/ConsistencyLevel is 
>> SliceByNamesReadCommand(table='open', 
>> key=804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342, 
>> columnParent='QueryPath(columnFamilyName='bob', 
>> superColumnName='null', columnName='null')', 
>> columns=[about,changed,data,])/ONE
>> DEBUG [Thrift:12] 2012-12-14 22:55:16,992 ReadCallback.java (line 79) 
>> Blockfor is 1; setting up requests to /10.0.4.20
>> DEBUG [Thrift:12] 2012-12-14 22:55:16,992 StorageProxy.java (line 
>> 669) reading data locally
>> DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 StorageProxy.java (line 
>> 813) LocalReadRunnable reading SliceByNamesReadCommand(table='open', 
>> key=804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342, 
>> columnParent='QueryPath(columnFamilyName='bob', 
>> superColumnName='null', columnName='null')', 
>> columns=[about,changed,data,])
>> DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 CollationController.java 
>> (line 68) In get top level columns: class 
>> org.apache.cassandra.db.filter.NamesQueryFilter type: Standard valid: 
>> class org.apache.cassandra.db.marshal.BytesType
>> DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 CollationController.java 
>> (line 84) collectTimeOrderedData
>> ---> DEBUG [ReadStage:61] 2012-12-14 22:55:17,192 
>> CollationController.java (line 188) tables iterated: 4 Min compact: 2
>> ----> DEBUG [ReadStage:61] 2012-12-14 22:55:17,192 
>> CollationController.java (line 198) Apply hoisted up row mutation
>> DEBUG [ReadStage:61] 2012-12-14 22:55:17,193 Table.java (line 395) 
>> applying mutation of row 
>> 804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342
>>
>> The above traces will occur every time I repeat the above select 
>> statement.
>>
>> Minor compaction doesn't start until a few minutes after the request 
>> was submitted above (note, this is an unloaded test node):
>>
>> DEBUG [CompactionExecutor:11] 2012-12-14 22:57:03,278 
>> IntervalNode.java (line 45) Creating IntervalNode from 
>> [Interval(DecoratedKey(Token(bytes[804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce...
>>
>> Once minor compaction occurs, the behavior around write count being 
>> incremented stops, until more than the minimum compaction threshold 
>> memtables are flush to disk.
>>
>> So, my questions are:
>>
>> 1) Am I reading things correctly?
>>
>> 2) What is really happening here?  Essentially minor compactions can 
>> occur between 4 and 32 memtable flushes.  Looking through the code, 
>> this seems to only effect a couple types of select statements (when 
>> selecting a specific column on a specific key being one of them). 
>> During the time between these two values, every "select" statement 
>> will perform a write.
>>
>> 3) Is this desired behavior?  Is there something else I should be 
>> looking at that could be causing this behavior?
>>
>> We are running Cassandra 1.1.2, with SizeTieredCompactionStrategy.
>> Any help is appreciated,
>> Thanks,
>> -Mike
>>
>>
>>
>

Re: Read operations resulting in a write?

Posted by aaron morton <aa...@thelastpickle.com>.

> 1) Am I reading things correctly?
Yes. 
If you do a read/slice by name and more than min compaction level nodes where read the data is re-written so that the next read uses fewer SSTables.

> 2) What is really happening here?  Essentially minor compactions can occur between 4 and 32 memtable flushes.  Looking through the code, this seems to only effect a couple types of select statements (when selecting a specific column on a specific key being one of them). During the time between these two values, every "select" statement will perform a write.
Yup, only for readying a row where the column names are specified.
Remember minor compaction when using SizedTiered Compaction (the default) works on buckets of the same size. 

Imagine a row that had been around for a while and had fragments in more than Min Compaction Threshold sstables. Say it is 3 SSTables in the 2nd tier and 2 sstables in the 1st. So it takes (potentially) 5 SSTable reads. If this row is read it will get hoisted back up. 

But the row has is in only 1 SSTable in the 2nd tier and 2 in the 1st tier it will not hoisted. 

There are a few short circuits in the SliceByName read path. One of them is to end the search when we know that no other SSTables contain columns that should be considered. So if the 4 columns you read frequently are hoisted into the 1st bucket your reads will get handled by that one bucket. 

It's not every select. Just those that touched more the min compaction sstables. 


> 3) Is this desired behavior?  Is there something else I should be looking at that could be causing this behavior?
Yes.
https://issues.apache.org/jira/browse/CASSANDRA-2503

Cheers

   
-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 15/12/2012, at 12:58 PM, Michael Theroux <mt...@yahoo.com> wrote:

> Hello,
> 
> We have an unusual situation that I believe I've reproduced, at least temporarily, in a test environment.  I also think I see where this issue is occurring in the code.
> 
> We have a specific column family that is under heavy read and write load on a nightly basis.   For the purposes of this description, I'll refer to this column family as "Bob".  During this nightly processing, sometimes Bob is under very write load, other times it is very heavy read load.
> 
> The application is such that when something is written to Bob, a write is made to one of two other tables.  We've witnessed a situation where the write count on Bob far outstrips the write count on either of the other tables, by a factor of 3->10.  This is based on the WriteCount available on the column family JMX MBean.  We have not been able to find where in our code this is happening, and we have gone as far as tracing our CQL calls to determine that the relationship between Bob and the other tables are what we expect.
> 
> I brought up a test node to experiment, and see a situation where, when a "select" statement is executed, a write will occur.
> 
> In my test, I perform the following (switching between nodetool and cqlsh):
> 
> update bob set 'about'='coworker' where key='<hex key>';    
> nodetool flush
> update bob set 'about'='coworker' where key='<hex key>';    
> nodetool flush
> update bob set 'about'='coworker' where key='<hex key>';    
> nodetool flush
> update bob set 'about'='coworker' where key='<hex key>';    
> nodetool flush
> update bob set 'about'='coworker' where key='<hex key>';    
> nodetool flush
> 
> Then, for a period of time (before a minor compaction occurs), a select statement that selects specific columns will cause writes to occur in the write count of the column family:
> 
> select about,changed,data from bob where key='<hex key>';
> 
> This situation will continue until a minor compaction is completed.
> 
> I went into the code and added some traces to CollationController.java:
> 
>   private ColumnFamily collectTimeOrderedData()
>     {
>         logger.debug("collectTimeOrderedData");
> 
>       ... <snip> ...
> 
> ---> HERE   logger.debug( "tables iterated: " + sstablesIterated +  " Min compact: " + cfs.getMinimumCompactionThreshold() );
>             // "hoist up" the requested data into a more recent sstable
>             if (sstablesIterated > cfs.getMinimumCompactionThreshold()
>                 && !cfs.isCompactionDisabled()
>                 && cfs.getCompactionStrategy() instanceof SizeTieredCompactionStrategy)
>             {
>                 RowMutation rm = new RowMutation(cfs.table.name, new Row(filter.key, returnCF.cloneMe()));
>                 try
>                 {
> ---> HERE               logger.debug( "Apply hoisted up row mutation" );	
>                     // skipping commitlog and index updates is fine since we're just de-fragmenting existing data
>                     Table.open(rm.getTable()).apply(rm, false, false);
>                 }
>                 catch (IOException e)
>                 {
>                     // log and allow the result to be returned
>                     logger.error("Error re-writing read results", e);
>                 }
>             } 
> ... <snip> ...
> 
> Performing the steps above, I see the following traces (in the test environment I decreased the minimum compaction threshold to make this easier to reproduce).  After I do a couple of update/flush, I see this in the log:
> 
> DEBUG [FlushWriter:7] 2012-12-14 22:54:40,106 CompactionManager.java (line 117) Scheduling a background task check for bob with SizeTieredCompactionStrategy
> 
> Then, until compaction occurs, I see (when performing a select):
> 
> DEBUG [ScheduledTasks:1] 2012-12-14 22:55:15,998 LoadBroadcaster.java (line 86) Disseminating load info ...
> DEBUG [Thrift:12] 2012-12-14 22:55:16,990 CassandraServer.java (line 1227) execute_cql_query
> DEBUG [Thrift:12] 2012-12-14 22:55:16,991 QueryProcessor.java (line 445) CQL statement type: SELECT
> DEBUG [Thrift:12] 2012-12-14 22:55:16,991 StorageProxy.java (line 653) Command/ConsistencyLevel is SliceByNamesReadCommand(table='open', key=804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342, columnParent='QueryPath(columnFamilyName='bob', superColumnName='null', columnName='null')', columns=[about,changed,data,])/ONE
> DEBUG [Thrift:12] 2012-12-14 22:55:16,992 ReadCallback.java (line 79) Blockfor is 1; setting up requests to /10.0.4.20
> DEBUG [Thrift:12] 2012-12-14 22:55:16,992 StorageProxy.java (line 669) reading data locally
> DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 StorageProxy.java (line 813) LocalReadRunnable reading SliceByNamesReadCommand(table='open', key=804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342, columnParent='QueryPath(columnFamilyName='bob', superColumnName='null', columnName='null')', columns=[about,changed,data,])
> DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 CollationController.java (line 68) In get top level columns: class org.apache.cassandra.db.filter.NamesQueryFilter type: Standard valid: class org.apache.cassandra.db.marshal.BytesType
> DEBUG [ReadStage:61] 2012-12-14 22:55:16,992 CollationController.java (line 84) collectTimeOrderedData
> ---> DEBUG [ReadStage:61] 2012-12-14 22:55:17,192 CollationController.java (line 188) tables iterated: 4 Min compact: 2
> ----> DEBUG [ReadStage:61] 2012-12-14 22:55:17,192 CollationController.java (line 198) Apply hoisted up row mutation
> DEBUG [ReadStage:61] 2012-12-14 22:55:17,193 Table.java (line 395) applying mutation of row 804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce21b5f402a5342
> 
> The above traces will occur every time I repeat the above select statement.
> 
> Minor compaction doesn't start until a few minutes after the request was submitted above (note, this is an unloaded test node):
> 
> DEBUG [CompactionExecutor:11] 2012-12-14 22:57:03,278 IntervalNode.java (line 45) Creating IntervalNode from [Interval(DecoratedKey(Token(bytes[804229d1933669d0a25d2a38c8b26ded10069573003e6dbb1ce...
> 
> Once minor compaction occurs, the behavior around write count being incremented stops, until more than the minimum compaction threshold memtables are flush to disk.
> 
> So, my questions are:
> 
> 1) Am I reading things correctly?
> 
> 2) What is really happening here?  Essentially minor compactions can occur between 4 and 32 memtable flushes.  Looking through the code, this seems to only effect a couple types of select statements (when selecting a specific column on a specific key being one of them). During the time between these two values, every "select" statement will perform a write.
> 
> 3) Is this desired behavior?  Is there something else I should be looking at that could be causing this behavior?
> 
> We are running Cassandra 1.1.2, with SizeTieredCompactionStrategy.  
> Any help is appreciated,
> Thanks,
> -Mike
> 
> 
>