You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Keith Wright <kw...@nanigans.com> on 2014/07/24 20:48:27 UTC

Hot, large row

Hi all,

   We are seeing an issue where basically daily one of our nodes spikes in load and is churning in CMS heap pressure.  It appears that reads are backing up and my guess is that our application is reading a large row repeatedly.  Our write structure can lead itself to wide rows very infrequently (<0.001%) and we do our best to detect and delete them but obviously we’re missing a case.  Hoping for assistance on the following questions:

 *   How can I detect wide rows?
 *   Anyone know what debug level I can set so that I can see what reads the hot node is handling?  I’m hoping to see the “bad” row
 *   Any way to get the server to blacklist these wide rows automatically?

We’re using C* 2.0.6 with Vnodes.

Thanks

Re: Hot, large row

Posted by DuyHai Doan <do...@gmail.com>.

"If I run nodetool tpstats, I see a high number of items in the Pending
phase for ReadStage.  Other items mostly appear near empty.  In addition, I
see dropped reads" --> Have a look at system I/O & CPU stats to check for
possible bottlenecks.

This symptom is not necessarily caused by widerows. Sometimes your disk I/O
just can't keep up with the intense read burst.




On Thu, Jul 24, 2014 at 10:00 PM, Keith Wright <kw...@nanigans.com> wrote:

> I appreciate the feedback, doing it on the client side is interesting and
> I will start looking into that.  To be clear, here are the symptoms I am
> seeing:
>
>
>    - A node will start showing high load and the CMS collection time
>    jumps to 100+ ms/sec (per new is also up)
>    - If I run nodetool tpstats, I see a high number of items in the
>    Pending phase for ReadStage.  Other items mostly appear near empty.  In
>    addition, I see dropped reads.
>    - If I shutdown the effected node, two other nodes show increased
>    load.  I am at RF=3 so this would again suggest slow reads against a single
>    row
>
> Does it sound correct that my best course of action is to investigate a
> large row?  If it were a small row being called repeatedly, I assume OS/key
> cache would make that a VERY fast operation.
>
> Thanks
>
> From: DuyHai Doan <do...@gmail.com>
> Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
> Date: Thursday, July 24, 2014 at 3:53 PM
>
> To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
> Subject: Re: Hot, large row
>
> Your extract of cfhistograms show that there are no particular "wide
> rows". The widest has 61214 cells which is big but not that huge to be
> really a concern.
>
> Turning on trace probabilty only tells give you some "hints" about what
> kind of queries are done, it does not give the exact partition key nor
> other statement values, especially when you are using prepared statements
> ...
>
>
> "I am considering reducing read_request_timeout_in_ms: 5000 in
> cassandra.yaml so that it reduces the impact when this occurs." --> Don't
> do that, you'll only sweep dust under the carpet. Find the real issue and
> fix it instead of changing parameter to hide it.
>
> One solution would be on client side, to activate some logging to show the
> CQL3 statements the application is issuing that may overload the server.  I
> know that's better said than done but I don't have any other idea for the
> moment
>
> -------- Shameless self-promotion ------
>
> To support this kind of live prod debugging & investigation that I add a
> new dynamic query logging feature in Achilles:
> https://github.com/doanduyhai/Achilles/wiki/Statements-Logging-and-Tracing#dynamic-statements-logging
>
> Once you hit the issue, this kind of feature may save your day...
>
>
>
>
>
>
> On Thu, Jul 24, 2014 at 9:22 PM, Keith Wright <kw...@nanigans.com>
> wrote:
>
>> I can see from cfhistograms that I do have some wide rows (see below).  I
>> set trace probability as you suggested but the output doesn’t appear to
>> tell me what row was actually read unless I missed something.  I just see
>> executing prepared statement.   Any ideas how I can find the row in
>> question?
>>
>> I am considering reducing read_request_timeout_in_ms: 5000 in
>> cassandra.yaml so that it reduces the impact when this occurs.
>>
>> Any help in identifying my issue would be GREATLY appreciated
>>
>> Cell Count per Partition
>>
>>     1 cells: 50449950
>>
>>     2 cells: 14281828
>>
>>     3 cells: 8093366
>>
>>     4 cells: 5029200
>>
>>     5 cells: 3103023
>>
>>     6 cells: 3059903
>>
>>     7 cells: 1903018
>>
>>     8 cells: 1509297
>>
>>    10 cells: 2420359
>>
>>    12 cells: 1624895
>>
>>    14 cells: 1171678
>>
>>    17 cells: 1289391
>>
>>    20 cells: 909777
>>
>>    24 cells: 852081
>>
>>    29 cells: 722925
>>
>>    35 cells: 587067
>>
>>    42 cells: 459473
>>
>>    50 cells: 358744
>>
>>    60 cells: 304146
>>
>>    72 cells: 244682
>>
>>    86 cells: 191045
>>
>>   103 cells: 155337
>>
>>   124 cells: 127061
>>
>>   149 cells: 98913
>>
>>   179 cells: 77454
>>
>>   215 cells: 59849
>>
>>   258 cells: 46117
>>
>>   310 cells: 35321
>>
>>   372 cells: 26319
>>
>>   446 cells: 19379
>>
>>   535 cells: 13783
>>
>>   642 cells: 9993
>>
>>   770 cells: 6973
>>
>>   924 cells: 4713
>>
>>  1109 cells: 3229
>>
>>  1331 cells: 2062
>>
>>  1597 cells: 1338
>>
>>  1916 cells: 773
>>
>>  2299 cells: 495
>>
>>  2759 cells: 268
>>
>>  3311 cells: 150
>>
>>  3973 cells: 100
>>
>>  4768 cells: 42
>>
>>  5722 cells: 24
>>
>>  6866 cells: 12
>>
>>  8239 cells: 9
>>
>>  9887 cells: 3
>>
>> 11864 cells: 0
>>
>> 14237 cells: 5
>>
>> 17084 cells: 1
>>
>> 20501 cells: 0
>>
>> 24601 cells: 2
>>
>> 29521 cells: 0
>>
>> 35425 cells: 0
>>
>> 42510 cells: 0
>>
>> 51012 cells: 0
>>
>> 61214 cells: 2
>>
>> From: DuyHai Doan <do...@gmail.com>
>> Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
>> Date: Thursday, July 24, 2014 at 3:01 PM
>> To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
>> Subject: Re: Hot, large row
>>
>> "How can I detect wide rows?" -->
>>
>> nodetool cfhistograms <keyspace> <suspected column family>
>>
>> Look at column "Column count" (last column) and identify a line in this
>> column with very high value of "Offset". In a well designed application you
>> should have a gaussian distribution where 80% of your row have a similar
>> number of columns.
>>
>> "Anyone know what debug level I can set so that I can see what reads the
>> hot node is handling?  " -->
>>
>> "nodetool settraceprobability <value>",  where value is a small number
>> (0.001) on the node where you encounter the issue. Activate the tracing for
>> a while (5 mins) and deactivate it (value = 0). Then look into
>> system_traces tables "events" & "sessions". It may help or not since the
>> tracing is done once every 1000.
>>
>> "Any way to get the server to blacklist these wide rows automatically?"
>> --> No
>>
>>
>> On Thu, Jul 24, 2014 at 8:48 PM, Keith Wright <kw...@nanigans.com>
>> wrote:
>>
>>> Hi all,
>>>
>>>    We are seeing an issue where basically daily one of our nodes spikes
>>> in load and is churning in CMS heap pressure.  It appears that reads are
>>> backing up and my guess is that our application is reading a large row
>>> repeatedly.  Our write structure can lead itself to wide rows very
>>> infrequently (<0.001%) and we do our best to detect and delete them but
>>> obviously we’re missing a case.  Hoping for assistance on the following
>>> questions:
>>>
>>>    - How can I detect wide rows?
>>>    - Anyone know what debug level I can set so that I can see what
>>>    reads the hot node is handling?  I’m hoping to see the “bad” row
>>>    - Any way to get the server to blacklist these wide rows
>>>    automatically?
>>>
>>> We’re using C* 2.0.6 with Vnodes.
>>>
>>> Thanks
>>>
>>
>>
>

Re: Hot, large row

Posted by Keith Wright <kw...@nanigans.com>.

I don’t know but my guess is it would be without tombstones.  I did more research this weekend (note that my Sunday was largely interrupted by again seeing a node go to high load/high CMS for ~3 hours) and came across this presentation:  http://www.slideshare.net/mobile/planetcassandra/8-axel-liljencrantz-23204252

I definitely suggestion you give this a look, very informative.  The important take away is that they ran into the same issue as I due to using the same model where I am updating to the same row over time with a TTL causing that row to fragment across SSTables and once across 4+ tables, compaction can never actually remove tombstones.  As I see it, I have the following options and was hoping to get some advice:

1.  Modify my write structure to include time within the key.  Currently we want to get all of a row but I can likely add month to the time and it would be ok for the application to do two reads to get the most recent data (to deal with month boundaries).  This will contain the fragmentation to one month.

2.  Following off of item #1, it appears that according to CASSANDRA-5514 that if I include time within my query it will not bother going through older SSTables and thus reduce the impact of the row fragmentation.  Problem here is that likely my data space will still continue to grow over time as tombstones will never be removed.

3.  Move from LCS to STCS and run full compactions periodically to cleanup tombstones

I appreciate the help!

From: Jack Krupansky <ja...@basetechnology.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Friday, July 25, 2014 at 11:15 AM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Hot, large row

Is it the accumulated tombstones on a row that make it act as if “wide”? Does cfhistograms count the tombstones or subtract them when reporting on cell-count for rows? (I don’t know.)

-- Jack Krupansky

From: Keith Wright<ma...@nanigans.com>
Sent: Friday, July 25, 2014 10:24 AM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Cc: Don Jackson<ma...@nanigans.com>
Subject: Re: Hot, large row

Ha, check out who filed that ticket!   Yes I’m aware of it.  My hope is that it was mostly addressed in CASSANDRA-6563 so I may upgrade from 2.0.6 to 2.0.9.  I’m really just surprised that others are not doing similar actions as I and thus experiencing similar issues.

To answer DuyHai’s questions:

How many nodes do you have ? And how many distinct user_id roughtly is there ?
- 14 nodes with approximately 250 million distinct user_ids

For GC activity, in general we see low GC pressure in both Par New and CMS (we see the occasional CMS spike but its usually under 100 ms).  When we see a node locked up in CMS GC, its not that anyone GC takes a long time, its just that the consistent nature of them causes the read latency to spike from the usual 3-5 ms up to 35 ms which causes issues for our application.

Also Jack Krupansky question is interesting. Even though you limit a request to 5000, if each cell is a big blob or block of text, it mays add up a lot into JVM heap …
- The columns values are actually timestamps and thus not variable in length and we cap the length of other columns used in the primary key so I find if VERY unlikely that this is a cause.

I will look into the paging option with that native client but from the docs it appears that its enabled by default, right?

I greatly appreciate all the help!

From: Ken Hancock <ke...@schange.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Friday, July 25, 2014 at 10:06 AM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Cc: Don Jackson <dj...@nanigans.com>>
Subject: Re: Hot, large row

https://issues.apache.org/jira/browse/CASSANDRA-6654

Re: Hot, large row

Posted by Jack Krupansky <ja...@basetechnology.com>.

Is it the accumulated tombstones on a row that make it act as if “wide”? Does cfhistograms count the tombstones or subtract them when reporting on cell-count for rows? (I don’t know.)

-- Jack Krupansky

From: Keith Wright 
Sent: Friday, July 25, 2014 10:24 AM
To: user@cassandra.apache.org 
Cc: Don Jackson 
Subject: Re: Hot, large row

Ha, check out who filed that ticket!   Yes I’m aware of it.  My hope is that it was mostly addressed in CASSANDRA-6563 so I may upgrade from 2.0.6 to 2.0.9.  I’m really just surprised that others are not doing similar actions as I and thus experiencing similar issues.

To answer DuyHai’s questions:

How many nodes do you have ? And how many distinct user_id roughtly is there ?
- 14 nodes with approximately 250 million distinct user_ids

For GC activity, in general we see low GC pressure in both Par New and CMS (we see the occasional CMS spike but its usually under 100 ms).  When we see a node locked up in CMS GC, its not that anyone GC takes a long time, its just that the consistent nature of them causes the read latency to spike from the usual 3-5 ms up to 35 ms which causes issues for our application.

Also Jack Krupansky question is interesting. Even though you limit a request to 5000, if each cell is a big blob or block of text, it mays add up a lot into JVM heap … 
- The columns values are actually timestamps and thus not variable in length and we cap the length of other columns used in the primary key so I find if VERY unlikely that this is a cause.

I will look into the paging option with that native client but from the docs it appears that its enabled by default, right?  

I greatly appreciate all the help!

From: Ken Hancock <ke...@schange.com>
Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Date: Friday, July 25, 2014 at 10:06 AM
To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Cc: Don Jackson <dj...@nanigans.com>
Subject: Re: Hot, large row

https://issues.apache.org/jira/browse/CASSANDRA-6654

Re: Hot, large row

Posted by Keith Wright <kw...@nanigans.com>.

Ha, check out who filed that ticket!   Yes I’m aware of it.  My hope is that it was mostly addressed in CASSANDRA-6563 so I may upgrade from 2.0.6 to 2.0.9.  I’m really just surprised that others are not doing similar actions as I and thus experiencing similar issues.

To answer DuyHai’s questions:

How many nodes do you have ? And how many distinct user_id roughtly is there ?
- 14 nodes with approximately 250 million distinct user_ids

For GC activity, in general we see low GC pressure in both Par New and CMS (we see the occasional CMS spike but its usually under 100 ms).  When we see a node locked up in CMS GC, its not that anyone GC takes a long time, its just that the consistent nature of them causes the read latency to spike from the usual 3-5 ms up to 35 ms which causes issues for our application.

Also Jack Krupansky question is interesting. Even though you limit a request to 5000, if each cell is a big blob or block of text, it mays add up a lot into JVM heap …
- The columns values are actually timestamps and thus not variable in length and we cap the length of other columns used in the primary key so I find if VERY unlikely that this is a cause.

I will look into the paging option with that native client but from the docs it appears that its enabled by default, right?

I greatly appreciate all the help!

From: Ken Hancock <ke...@schange.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Friday, July 25, 2014 at 10:06 AM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Cc: Don Jackson <dj...@nanigans.com>>
Subject: Re: Hot, large row

https://issues.apache.org/jira/browse/CASSANDRA-6654

Re: Hot, large row

Posted by Ken Hancock <ke...@schange.com>.

Keith,

If I'm understanding your schema it sounds like you have rows/partitions
with TTLs that continuously grow over time.  This sounds a lot like
https://issues.apache.org/jira/browse/CASSANDRA-6654 which was marked as
working as designed, but totally unintuitive.  (Johnathan Ellis went off
and spent the time to reproduce which indicates to me it was equally
unintuitive to someone who is pretty knowlegeable on Cassandra.).

Ken





On Thu, Jul 24, 2014 at 10:10 PM, Keith Wright <kw...@nanigans.com> wrote:

> One last item to add to this thread:  we have consistently experienced
> this behavior where over time performance degrades (previously we were
> unable to bootstrap nodes to due long GC pauses from existing nodes).  I
> believe its due to tombstone build up (as I mentioned previously one of the
> tables mentioned is showing a droppable tombstone ratio of > 30%).   The
> sku table is used to hold SKUs that recent users viewed/purchased.  When we
> write a new SKU we set the TTL to 30 days where the row key is the user id;
> our read case is to fetch ALL skus the user has seen within the TTL.  Since
> the user sees SKUs consistently over time, this can result in a row with
> many columns much of which are likely tombstoned (see CASSANDRA-6654 which
> I filed for this which shows that C* does not handle this case well).
>
> I guess I’m just surprised that others aren’t using C* for similar usage
> cases and thus having the same issue?
>
>  I am hoping to upgrade to 2.0.9 which has improvements to remove
> tombstones.
>
> From: Keith Wright <kw...@nanigans.com>
> Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
> Date: Thursday, July 24, 2014 at 4:50 PM
> To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
> Cc: Don Jackson <dj...@nanigans.com>
>
> Subject: Re: Hot, large row
>
> When a node is showing the high CMS issue, io is actually low likely due
> to the fact that none is going on during CMS GC.  On a node not showing the
> issue, iostat shows disk usage around 50% (these are SSD) and load hovers
> around 10 for a dual octo core machine this is fine.
>
> In addition, nodetool compactionstats does not show that we are falling
> behind in compactions.
>
> So not sure what is going on here.  We are running CentOS 6.5 with java
> 1.7.0_51.  It does seem like things are getting worse and I’m considering
> dropping and rebuilding all the tables (as I have the data in Hadoop).
>  This seems to be a repeated problem for us with Cassandra and now that
> Aerospike has an open source version, we are very much considering
> switching.
>
> Thanks again for the help and any insight you might have!
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>
>           23.43   12.40   11.40    6.20    0.00   46.57
>
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz
> avgqu-sz   await  svctm  %util
>
> sda               0.00     0.00  720.50    0.00     5.51     0.00    15.66
>     1.05    1.54   0.57  41.35
>
> sdc               0.00     0.00 5930.50 2100.50    55.42     8.21    16.23
>     2.30    0.29   0.06  51.25
>
> sdb               0.00     0.00 6951.50 2052.50    65.82     8.02    16.80
>     4.31    0.48   0.07  59.60
>
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>
>            9.48   14.72    5.60    3.67    0.00   66.52
>
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz
> avgqu-sz   await  svctm  %util
>
> sda               0.00     0.00  327.00   19.50     2.55     0.08    15.53
>     0.04    0.13   0.12   4.25
>
> sdc               0.00     0.00 3974.50 1403.50    36.37     5.48    15.94
>     0.99    0.18   0.08  45.10
>
> sdb               0.00     0.00 4357.50 1535.00    40.83     6.00    16.28
>     1.10    0.19   0.08  47.45
>
> From: DuyHai Doan <do...@gmail.com>
> Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
> Date: Thursday, July 24, 2014 at 4:44 PM
> To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
> Subject: Re: Hot, large row
>
> For global_user_event_skus_v2
>
> 1. number of SSTables per read is quite huge. Considering you're using
> LCS, it means that LCS cannot keep up with write rate and is left behind.
> AFAIK LCS is using SizeTieredCompaction at L0 to cope with extreme write
> burst. Your high number of SSTables per read is quite fishy here.
>
> 2. Write latency is widespead up to 73.457 millisecs, meaning that your
> node is getting behind for write in some cases. Most of write are still
> below 1 millisec but we don't care. What we care here is the large trail of
> write latency climbing up to 73 millisecs
>
> 3. Same remarks for read latency which is worst because the distribution
> is even "flatter", worst cases going up to 100 ms.
>
> If I were you, I'll check for disk I/O first and maybe CPU usage
>
>
> On Thu, Jul 24, 2014 at 10:32 PM, Keith Wright <kw...@nanigans.com>
> wrote:
>
>> Cfhistograms for the tables I believe are most likely the issue are below
>> on the node that most recently presented the issue.  Any ideas?  Note that
>> these tables are LCS and have droppable tombstone ratios of 27% for
>> global_user_event_skus_v2 and 2.7% for the other.  Table definitions also
>> below, you can see that they do use compound keys.
>>
>> Thanks
>>
>> CREATE TABLE global_user_event_skus_v2 (
>>
>>   user_id timeuuid,
>>
>>   app_id int,
>>
>>   type text,
>>
>>   name text,
>>
>>   sku text,
>>
>>   creationtime timestamp,
>>
>>   PRIMARY KEY (user_id, app_id, type, name, sku)
>>
>> ) WITH
>>
>>   bloom_filter_fp_chance=0.100000 AND
>>
>>   caching='KEYS_ONLY' AND
>>
>>   comment='' AND
>>
>>   dclocal_read_repair_chance=0.000000 AND
>>
>>   gc_grace_seconds=86400 AND
>>
>>   index_interval=128 AND
>>
>>   read_repair_chance=0.100000 AND
>>
>>   replicate_on_write='true' AND
>>
>>   populate_io_cache_on_flush='false' AND
>>
>>   default_time_to_live=2592000 AND
>>
>>   speculative_retry='99.0PERCENTILE' AND
>>
>>   memtable_flush_period_in_ms=0 AND
>>
>>   compaction={'class': 'LeveledCompactionStrategy'} AND
>>
>>   compression={'sstable_compression': 'LZ4Compressor'};
>>
>>
>> users/global_user_event_skus_v2 histograms
>>
>>
>> SSTables per Read
>>
>>  1 sstables: 1398122
>>
>>  2 sstables: 231100
>>
>>  3 sstables: 97334
>>
>>  4 sstables: 56496
>>
>>  5 sstables: 43046
>>
>>  6 sstables: 35471
>>
>>  7 sstables: 26888
>>
>>  8 sstables: 21102
>>
>> 10 sstables: 29522
>>
>> 12 sstables: 18314
>>
>> 14 sstables: 11368
>>
>> 17 sstables: 10283
>>
>> 20 sstables: 5766
>>
>> 24 sstables: 3554
>>
>> 29 sstables: 1681
>>
>> 35 sstables: 521
>>
>> 42 sstables: 102
>>
>> 50 sstables: 1
>>
>>
>> Write Latency (microseconds)
>>
>>      2 us: 532
>>
>>      3 us: 10728
>>
>>      4 us: 49608
>>
>>      5 us: 173718
>>
>>      6 us: 404131
>>
>>      7 us: 697468
>>
>>      8 us: 981048
>>
>>     10 us: 2643680
>>
>>     12 us: 3183727
>>
>>     14 us: 3150881
>>
>>     17 us: 3746333
>>
>>     20 us: 2298481
>>
>>     24 us: 1647604
>>
>>     29 us: 1070733
>>
>>     35 us: 693300
>>
>>     42 us: 442120
>>
>>     50 us: 288618
>>
>>     60 us: 205061
>>
>>     72 us: 141194
>>
>>     86 us: 98781
>>
>>    103 us: 73629
>>
>>    124 us: 54900
>>
>>    149 us: 40518
>>
>>    179 us: 30330
>>
>>    215 us: 23953
>>
>>    258 us: 18875
>>
>>    310 us: 14147
>>
>>    372 us: 10509
>>
>>    446 us: 7984
>>
>>    535 us: 6257
>>
>>    642 us: 4903
>>
>>    770 us: 4086
>>
>>    924 us: 3252
>>
>>   1109 us: 2676
>>
>>   1331 us: 2129
>>
>>   1597 us: 1753
>>
>>   1916 us: 1488
>>
>>   2299 us: 1237
>>
>>   2759 us: 1078
>>
>>   3311 us: 866
>>
>>   3973 us: 704
>>
>>   4768 us: 551
>>
>>   5722 us: 553
>>
>>   6866 us: 774
>>
>>   8239 us: 443
>>
>>   9887 us: 303
>>
>>  11864 us: 222
>>
>>  14237 us: 181
>>
>>  17084 us: 146
>>
>>  20501 us: 147
>>
>>  24601 us: 499
>>
>>  29521 us: 842
>>
>>  35425 us: 548
>>
>>  42510 us: 303
>>
>>  51012 us: 560
>>
>>  61214 us: 842
>>
>>  73457 us: 600
>>
>>  88148 us: 283
>>
>> 105778 us: 173
>>
>> 126934 us: 81
>>
>> 152321 us: 18
>>
>> 182785 us: 2
>>
>>
>> Read Latency (microseconds)
>>
>>      35 us: 2
>>
>>      42 us: 72
>>
>>      50 us: 737
>>
>>      60 us: 3555
>>
>>      72 us: 6926
>>
>>      86 us: 8373
>>
>>     103 us: 15287
>>
>>     124 us: 30054
>>
>>     149 us: 46164
>>
>>     179 us: 67828
>>
>>     215 us: 86741
>>
>>     258 us: 92178
>>
>>     310 us: 82205
>>
>>     372 us: 69641
>>
>>     446 us: 64710
>>
>>     535 us: 58295
>>
>>     642 us: 55247
>>
>>     770 us: 57918
>>
>>     924 us: 56928
>>
>>    1109 us: 75762
>>
>>    1331 us: 122306
>>
>>    1597 us: 141564
>>
>>    1916 us: 117301
>>
>>    2299 us: 102293
>>
>>    2759 us: 98789
>>
>>    3311 us: 86923
>>
>>    3973 us: 75304
>>
>>    4768 us: 64012
>>
>>    5722 us: 54973
>>
>>    6866 us: 47221
>>
>>    8239 us: 41483
>>
>>    9887 us: 35015
>>
>>   11864 us: 30249
>>
>>   14237 us: 23505
>>
>>   17084 us: 18364
>>
>>   20501 us: 12992
>>
>>   24601 us: 8837
>>
>>   29521 us: 6497
>>
>>   35425 us: 4840
>>
>>   42510 us: 3536
>>
>>   51012 us: 2471
>>
>>   61214 us: 2784
>>
>>   73457 us: 3664
>>
>>   88148 us: 3219
>>
>>  105778 us: 1980
>>
>>  126934 us: 1050
>>
>>  152321 us: 519
>>
>>  182785 us: 184
>>
>>  219342 us: 82
>>
>>  263210 us: 39
>>
>>  315852 us: 22
>>
>>  379022 us: 16
>>
>>  454826 us: 0
>>
>>  545791 us: 0
>>
>>  654949 us: 1
>>
>>  785939 us: 1
>>
>>  943127 us: 0
>>
>> 1131752 us: 0
>>
>> 1358102 us: 0
>>
>> 1629722 us: 3
>>
>> 1955666 us: 3
>>
>> 2346799 us: 1
>>
>> 2816159 us: 0
>>
>> 3379391 us: 0
>>
>> 4055269 us: 0
>>
>> 4866323 us: 0
>>
>> 5839588 us: 0
>>
>> 7007506 us: 4
>>
>>
>> Partition Size (bytes)
>>
>>      35 bytes: 31
>>
>>      42 bytes: 0
>>
>>      50 bytes: 0
>>
>>      60 bytes: 0
>>
>>      72 bytes: 0
>>
>>      86 bytes: 0
>>
>>     103 bytes: 761108
>>
>>     124 bytes: 23596164
>>
>>     149 bytes: 22963275
>>
>>     179 bytes: 3517971
>>
>>     215 bytes: 6395335
>>
>>     258 bytes: 6197982
>>
>>     310 bytes: 4662589
>>
>>     372 bytes: 5261934
>>
>>     446 bytes: 3032032
>>
>>     535 bytes: 3775310
>>
>>     642 bytes: 2879935
>>
>>     770 bytes: 3002743
>>
>>     924 bytes: 2279900
>>
>>    1109 bytes: 1976357
>>
>>    1331 bytes: 1643875
>>
>>    1597 bytes: 1422199
>>
>>    1916 bytes: 1142746
>>
>>    2299 bytes: 901584
>>
>>    2759 bytes: 760173
>>
>>    3311 bytes: 596874
>>
>>    3973 bytes: 480502
>>
>>    4768 bytes: 380185
>>
>>    5722 bytes: 303315
>>
>>    6866 bytes: 239013
>>
>>    8239 bytes: 187337
>>
>>    9887 bytes: 147218
>>
>>   11864 bytes: 115684
>>
>>   14237 bytes: 89718
>>
>>   17084 bytes: 69160
>>
>>   20501 bytes: 53070
>>
>>   24601 bytes: 40496
>>
>>   29521 bytes: 30344
>>
>>   35425 bytes: 22656
>>
>>   42510 bytes: 16419
>>
>>   51012 bytes: 11621
>>
>>   61214 bytes: 8020
>>
>>   73457 bytes: 5561
>>
>>   88148 bytes: 3708
>>
>>  105778 bytes: 2434
>>
>>  126934 bytes: 1561
>>
>>  152321 bytes: 1017
>>
>>  182785 bytes: 559
>>
>>  219342 bytes: 349
>>
>>  263210 bytes: 181
>>
>>  315852 bytes: 121
>>
>>  379022 bytes: 66
>>
>>  454826 bytes: 22
>>
>>  545791 bytes: 14
>>
>>  654949 bytes: 10
>>
>>  785939 bytes: 2
>>
>>  943127 bytes: 0
>>
>> 1131752 bytes: 4
>>
>> 1358102 bytes: 2
>>
>> 1629722 bytes: 1
>>
>> 1955666 bytes: 2
>>
>> 2346799 bytes: 0
>>
>> 2816159 bytes: 0
>>
>> 3379391 bytes: 0
>>
>> 4055269 bytes: 0
>>
>> 4866323 bytes: 1
>>
>>
>> Cell Count per Partition
>>
>>     1 cells: 50599458
>>
>>     2 cells: 13975363
>>
>>     3 cells: 7927898
>>
>>     4 cells: 4981142
>>
>>     5 cells: 3081166
>>
>>     6 cells: 3034895
>>
>>     7 cells: 1896720
>>
>>     8 cells: 1508849
>>
>>    10 cells: 2418845
>>
>>    12 cells: 1631289
>>
>>    14 cells: 1179269
>>
>>    17 cells: 1301647
>>
>>    20 cells: 919744
>>
>>    24 cells: 864313
>>
>>    29 cells: 735203
>>
>>    35 cells: 597749
>>
>>    42 cells: 469611
>>
>>    50 cells: 367422
>>
>>    60 cells: 312697
>>
>>    72 cells: 251782
>>
>>    86 cells: 197717
>>
>>   103 cells: 161244
>>
>>   124 cells: 132074
>>
>>   149 cells: 103711
>>
>>   179 cells: 81760
>>
>>   215 cells: 63360
>>
>>   258 cells: 48982
>>
>>   310 cells: 37932
>>
>>   372 cells: 28328
>>
>>   446 cells: 21347
>>
>>   535 cells: 15373
>>
>>   642 cells: 10959
>>
>>   770 cells: 7643
>>
>>   924 cells: 5346
>>
>>  1109 cells: 3613
>>
>>  1331 cells: 2291
>>
>>  1597 cells: 1558
>>
>>  1916 cells: 885
>>
>>  2299 cells: 577
>>
>>  2759 cells: 327
>>
>>  3311 cells: 169
>>
>>  3973 cells: 116
>>
>>  4768 cells: 57
>>
>>  5722 cells: 23
>>
>>  6866 cells: 15
>>
>>  8239 cells: 9
>>
>>  9887 cells: 3
>>
>> 11864 cells: 0
>>
>> 14237 cells: 5
>>
>> 17084 cells: 1
>>
>> 20501 cells: 0
>>
>> 24601 cells: 2
>>
>> 29521 cells: 0
>>
>> 35425 cells: 0
>>
>> 42510 cells: 0
>>
>> 51012 cells: 0
>>
>> 61214 cells: 1
>>
>>
>> CREATE TABLE global_user_events (
>>
>>   user_id timeuuid,
>>
>>   app_id int,
>>
>>   type text,
>>
>>   name text,
>>
>>   extra_param map<text, text>,
>>
>>   paid boolean,
>>
>>   value float,
>>
>>   PRIMARY KEY (user_id, app_id, type, name)
>>
>> ) WITH
>>
>>   bloom_filter_fp_chance=0.100000 AND
>>
>>   caching='KEYS_ONLY' AND
>>
>>   comment='' AND
>>
>>   dclocal_read_repair_chance=0.000000 AND
>>
>>   gc_grace_seconds=86400 AND
>>
>>   index_interval=128 AND
>>
>>   read_repair_chance=0.100000 AND
>>
>>   replicate_on_write='true' AND
>>
>>   populate_io_cache_on_flush='false' AND
>>
>>   default_time_to_live=0 AND
>>
>>   speculative_retry='99.0PERCENTILE' AND
>>
>>   memtable_flush_period_in_ms=0 AND
>>
>>   compaction={'class': 'LeveledCompactionStrategy'} AND
>>
>>   compression={'sstable_compression': 'LZ4Compressor'};
>>
>>
>> users/global_user_events histograms
>>
>>
>> SSTables per Read
>>
>>  1 sstables: 1024446
>>
>>  2 sstables: 632948
>>
>>  3 sstables: 230768
>>
>>  4 sstables: 112428
>>
>>  5 sstables: 82651
>>
>>  6 sstables: 61439
>>
>>  7 sstables: 47518
>>
>>  8 sstables: 38490
>>
>> 10 sstables: 58054
>>
>> 12 sstables: 41128
>>
>> 14 sstables: 29642
>>
>> 17 sstables: 31056
>>
>> 20 sstables: 20474
>>
>> 24 sstables: 17799
>>
>> 29 sstables: 13744
>>
>> 35 sstables: 9485
>>
>> 42 sstables: 5933
>>
>> 50 sstables: 3607
>>
>> 60 sstables: 1618
>>
>> 72 sstables: 263
>>
>> 86 sstables: 10
>>
>>
>> Write Latency (microseconds)
>>
>>      10 us: 10
>>
>>      12 us: 75
>>
>>      14 us: 319
>>
>>      17 us: 3550
>>
>>      20 us: 20617
>>
>>      24 us: 105254
>>
>>      29 us: 311290
>>
>>      35 us: 494404
>>
>>      42 us: 516301
>>
>>      50 us: 534061
>>
>>      60 us: 421438
>>
>>      72 us: 314377
>>
>>      86 us: 375240
>>
>>     103 us: 333402
>>
>>     124 us: 816891
>>
>>     149 us: 667832
>>
>>     179 us: 430659
>>
>>     215 us: 336063
>>
>>     258 us: 380606
>>
>>     310 us: 235592
>>
>>     372 us: 68679
>>
>>     446 us: 14988
>>
>>     535 us: 6169
>>
>>     642 us: 3763
>>
>>     770 us: 2591
>>
>>     924 us: 2270
>>
>>    1109 us: 1985
>>
>>    1331 us: 1744
>>
>>    1597 us: 1564
>>
>>    1916 us: 1489
>>
>>    2299 us: 1326
>>
>>    2759 us: 1114
>>
>>    3311 us: 1024
>>
>>    3973 us: 984
>>
>>    4768 us: 833
>>
>>    5722 us: 962
>>
>>    6866 us: 1076
>>
>>    8239 us: 891
>>
>>    9887 us: 611
>>
>>   11864 us: 489
>>
>>   14237 us: 324
>>
>>   17084 us: 306
>>
>>   20501 us: 222
>>
>>   24601 us: 426
>>
>>   29521 us: 791
>>
>>   35425 us: 820
>>
>>   42510 us: 534
>>
>>   51012 us: 530
>>
>>   61214 us: 946
>>
>>   73457 us: 1122
>>
>>   88148 us: 714
>>
>>  105778 us: 313
>>
>>  126934 us: 127
>>
>>  152321 us: 38
>>
>>  182785 us: 3
>>
>>  219342 us: 2
>>
>>  263210 us: 0
>>
>>  315852 us: 0
>>
>>  379022 us: 0
>>
>>  454826 us: 2
>>
>>  545791 us: 0
>>
>>  654949 us: 3
>>
>>  785939 us: 0
>>
>>  943127 us: 0
>>
>> 1131752 us: 0
>>
>> 1358102 us: 0
>>
>> 1629722 us: 0
>>
>> 1955666 us: 0
>>
>> 2346799 us: 1
>>
>> 2816159 us: 0
>>
>> 3379391 us: 0
>>
>> 4055269 us: 0
>>
>> 4866323 us: 0
>>
>> 5839588 us: 0
>>
>> 7007506 us: 1
>>
>>
>> Read Latency (microseconds)
>>
>>      42 us: 1
>>
>>      50 us: 2
>>
>>      60 us: 2
>>
>>      72 us: 4
>>
>>      86 us: 15
>>
>>     103 us: 23
>>
>>     124 us: 29
>>
>>     149 us: 48
>>
>>     179 us: 67
>>
>>     215 us: 80
>>
>>     258 us: 316
>>
>>     310 us: 1812
>>
>>     372 us: 4999
>>
>>     446 us: 8876
>>
>>     535 us: 14586
>>
>>     642 us: 24417
>>
>>     770 us: 31943
>>
>>     924 us: 37453
>>
>>    1109 us: 77449
>>
>>    1331 us: 139135
>>
>>    1597 us: 177538
>>
>>    1916 us: 169568
>>
>>    2299 us: 177617
>>
>>    2759 us: 202552
>>
>>    3311 us: 198101
>>
>>    3973 us: 179807
>>
>>    4768 us: 159149
>>
>>    5722 us: 142282
>>
>>    6866 us: 126124
>>
>>    8239 us: 111529
>>
>>    9887 us: 97658
>>
>>   11864 us: 84913
>>
>>   14237 us: 70905
>>
>>   17084 us: 54917
>>
>>   20501 us: 40623
>>
>>   24601 us: 29545
>>
>>   29521 us: 21448
>>
>>   35425 us: 15255
>>
>>   42510 us: 11222
>>
>>   51012 us: 8150
>>
>>   61214 us: 7473
>>
>>   73457 us: 10060
>>
>>   88148 us: 10530
>>
>>  105778 us: 7573
>>
>>  126934 us: 4331
>>
>>  152321 us: 2097
>>
>>  182785 us: 836
>>
>>  219342 us: 255
>>
>>  263210 us: 84
>>
>>  315852 us: 28
>>
>>  379022 us: 8
>>
>>  454826 us: 2
>>
>>  545791 us: 0
>>
>>  654949 us: 2
>>
>>  785939 us: 1
>>
>>  943127 us: 0
>>
>> 1131752 us: 0
>>
>> 1358102 us: 0
>>
>> 1629722 us: 2
>>
>> 1955666 us: 6
>>
>> 2346799 us: 18
>>
>> 2816159 us: 0
>>
>> 3379391 us: 0
>>
>> 4055269 us: 0
>>
>> 4866323 us: 0
>>
>> 5839588 us: 0
>>
>> 7007506 us: 34
>>
>>
>> Partition Size (bytes)
>>
>>     35 bytes: 25
>>
>>     42 bytes: 0
>>
>>     50 bytes: 0
>>
>>     60 bytes: 0
>>
>>     72 bytes: 21761
>>
>>     86 bytes: 694466658
>>
>>    103 bytes: 219007195
>>
>>    124 bytes: 7034596
>>
>>    149 bytes: 110269079
>>
>>    179 bytes: 101276811
>>
>>    215 bytes: 38518027
>>
>>    258 bytes: 103242356
>>
>>    310 bytes: 51434914
>>
>>    372 bytes: 38048245
>>
>>    446 bytes: 42558831
>>
>>    535 bytes: 32869176
>>
>>    642 bytes: 17188026
>>
>>    770 bytes: 14750559
>>
>>    924 bytes: 15963256
>>
>>   1109 bytes: 16125324
>>
>>   1331 bytes: 11462272
>>
>>   1597 bytes: 6810452
>>
>>   1916 bytes: 5685739
>>
>>   2299 bytes: 10232324
>>
>>   2759 bytes: 5138522
>>
>>   3311 bytes: 5847998
>>
>>   3973 bytes: 4065752
>>
>>   4768 bytes: 3468263
>>
>>   5722 bytes: 2545431
>>
>>   6866 bytes: 1701913
>>
>>   8239 bytes: 993808
>>
>>   9887 bytes: 495282
>>
>>  11864 bytes: 208644
>>
>>  14237 bytes: 74307
>>
>>  17084 bytes: 22286
>>
>>  20501 bytes: 5474
>>
>>  24601 bytes: 1225
>>
>>  29521 bytes: 210
>>
>>  35425 bytes: 97
>>
>>  42510 bytes: 48
>>
>>  51012 bytes: 21
>>
>>  61214 bytes: 15
>>
>>  73457 bytes: 10
>>
>>  88148 bytes: 3
>>
>> 105778 bytes: 4
>>
>> 126934 bytes: 1
>>
>> 152321 bytes: 2
>>
>> 182785 bytes: 1
>>
>>
>> Cell Count per Partition
>>
>>    1 cells: 914323128
>>
>>    2 cells: 196020828
>>
>>    3 cells: 165502164
>>
>>    4 cells: 71735648
>>
>>    5 cells: 54450466
>>
>>    6 cells: 28995756
>>
>>    7 cells: 9625976
>>
>>    8 cells: 10257788
>>
>>   10 cells: 24004757
>>
>>   12 cells: 13087147
>>
>>   14 cells: 13588347
>>
>>   17 cells: 11976353
>>
>>   20 cells: 6245374
>>
>>   24 cells: 5627821
>>
>>   29 cells: 9480533
>>
>>   35 cells: 6008276
>>
>>   42 cells: 5221357
>>
>>   50 cells: 4355454
>>
>>   60 cells: 3955488
>>
>>   72 cells: 2872578
>>
>>   86 cells: 1914447
>>
>>  103 cells: 1216276
>>
>>  124 cells: 652035
>>
>>  149 cells: 279565
>>
>>  179 cells: 99836
>>
>>  215 cells: 29132
>>
>>  258 cells: 6682
>>
>>  310 cells: 1276
>>
>>  372 cells: 258
>>
>>  446 cells: 96
>>
>>  535 cells: 48
>>
>>  642 cells: 25
>>
>>  770 cells: 9
>>
>>  924 cells: 9
>>
>> 1109 cells: 4
>>
>> 1331 cells: 3
>>
>> 1597 cells: 1
>>
>> 1916 cells: 2
>>
>> From: Jack Krupansky <ja...@basetechnology.com>
>> Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
>> Date: Thursday, July 24, 2014 at 4:07 PM
>>
>> To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
>> Subject: Re: Hot, large row
>>
>> Could it be some “fat columns” (cells with large blob or text values)
>> rather than the cell-count per se? IOW, a “big row” rather than a “wide
>> row”?
>>
>> And, could it be a large partition rather than a large row (many rows in
>> a single partition)? Are clustering columns being used in the primary key?
>>
>> -- Jack Krupansky
>>
>> *From:* DuyHai Doan <do...@gmail.com>
>> *Sent:* Thursday, July 24, 2014 3:53 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Hot, large row
>>
>> Your extract of cfhistograms show that there are no particular "wide
>> rows". The widest has 61214 cells which is big but not that huge to be
>> really a concern.
>>
>> Turning on trace probabilty only tells give you some "hints" about what
>> kind of queries are done, it does not give the exact partition key nor
>> other statement values, especially when you are using prepared statements
>> ...
>>
>>
>> "I am considering reducing read_request_timeout_in_ms: 5000 in
>> cassandra.yaml so that it reduces the impact when this occurs." --> Don't
>> do that, you'll only sweep dust under the carpet. Find the real issue and
>> fix it instead of changing parameter to hide it.
>>
>> One solution would be on client side, to activate some logging to show
>> the CQL3 statements the application is issuing that may overload the
>> server.  I know that's better said than done but I don't have any other
>> idea for the moment
>>
>> -------- Shameless self-promotion ------
>>
>> To support this kind of live prod debugging & investigation that I add a
>> new dynamic query logging feature in Achilles:
>> https://github.com/doanduyhai/Achilles/wiki/Statements-Logging-and-Tracing#dynamic-statements-logging
>>
>> Once you hit the issue, this kind of feature may save your day...
>>
>>
>>
>>
>>
>>
>> On Thu, Jul 24, 2014 at 9:22 PM, Keith Wright <kw...@nanigans.com>
>> wrote:
>>
>>> I can see from cfhistograms that I do have some wide rows (see below).
>>> I set trace probability as you suggested but the output doesn’t appear to
>>> tell me what row was actually read unless I missed something.  I just see
>>> executing prepared statement.   Any ideas how I can find the row in
>>> question?
>>>
>>> I am considering reducing read_request_timeout_in_ms: 5000 in
>>> cassandra.yaml so that it reduces the impact when this occurs.
>>>
>>> Any help in identifying my issue would be GREATLY appreciated
>>>
>>>
>>> Cell Count per Partition
>>>
>>>     1 cells: 50449950
>>>
>>>     2 cells: 14281828
>>>
>>>     3 cells: 8093366
>>>
>>>     4 cells: 5029200
>>>
>>>     5 cells: 3103023
>>>
>>>     6 cells: 3059903
>>>
>>>     7 cells: 1903018
>>>
>>>     8 cells: 1509297
>>>
>>>    10 cells: 2420359
>>>
>>>    12 cells: 1624895
>>>
>>>    14 cells: 1171678
>>>
>>>    17 cells: 1289391
>>>
>>>    20 cells: 909777
>>>
>>>    24 cells: 852081
>>>
>>>    29 cells: 722925
>>>
>>>    35 cells: 587067
>>>
>>>    42 cells: 459473
>>>
>>>    50 cells: 358744
>>>
>>>    60 cells: 304146
>>>
>>>    72 cells: 244682
>>>
>>>    86 cells: 191045
>>>
>>>   103 cells: 155337
>>>
>>>   124 cells: 127061
>>>
>>>   149 cells: 98913
>>>
>>>   179 cells: 77454
>>>
>>>   215 cells: 59849
>>>
>>>   258 cells: 46117
>>>
>>>   310 cells: 35321
>>>
>>>   372 cells: 26319
>>>
>>>   446 cells: 19379
>>>
>>>   535 cells: 13783
>>>
>>>   642 cells: 9993
>>>
>>>   770 cells: 6973
>>>
>>>   924 cells: 4713
>>>
>>> 1109 cells: 3229
>>>
>>> 1331 cells: 2062
>>>
>>> 1597 cells: 1338
>>>
>>> 1916 cells: 773
>>>
>>> 2299 cells: 495
>>>
>>> 2759 cells: 268
>>>
>>> 3311 cells: 150
>>>
>>> 3973 cells: 100
>>>
>>> 4768 cells: 42
>>>
>>> 5722 cells: 24
>>>
>>> 6866 cells: 12
>>>
>>> 8239 cells: 9
>>>
>>> 9887 cells: 3
>>>
>>> 11864 cells: 0
>>>
>>> 14237 cells: 5
>>>
>>> 17084 cells: 1
>>>
>>> 20501 cells: 0
>>>
>>> 24601 cells: 2
>>>
>>> 29521 cells: 0
>>>
>>> 35425 cells: 0
>>>
>>> 42510 cells: 0
>>>
>>> 51012 cells: 0
>>>
>>> 61214 cells: 2
>>>
>>> From: DuyHai Doan <do...@gmail.com>
>>> Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
>>> Date: Thursday, July 24, 2014 at 3:01 PM
>>> To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
>>> Subject: Re: Hot, large row
>>>
>>> "How can I detect wide rows?" -->
>>>
>>> nodetool cfhistograms <keyspace> <suspected column family>
>>>
>>> Look at column "Column count" (last column) and identify a line in this
>>> column with very high value of "Offset". In a well designed application you
>>> should have a gaussian distribution where 80% of your row have a similar
>>> number of columns.
>>>
>>> "Anyone know what debug level I can set so that I can see what reads the
>>> hot node is handling?  " -->
>>>
>>> "nodetool settraceprobability <value>",  where value is a small number
>>> (0.001) on the node where you encounter the issue. Activate the tracing for
>>> a while (5 mins) and deactivate it (value = 0). Then look into
>>> system_traces tables "events" & "sessions". It may help or not since the
>>> tracing is done once every 1000.
>>>
>>> "Any way to get the server to blacklist these wide rows automatically?"
>>> --> No
>>>
>>>
>>> On Thu, Jul 24, 2014 at 8:48 PM, Keith Wright <kw...@nanigans.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>>    We are seeing an issue where basically daily one of our nodes spikes
>>>> in load and is churning in CMS heap pressure.  It appears that reads are
>>>> backing up and my guess is that our application is reading a large row
>>>> repeatedly.  Our write structure can lead itself to wide rows very
>>>> infrequently (<0.001%) and we do our best to detect and delete them but
>>>> obviously we’re missing a case.  Hoping for assistance on the following
>>>> questions:
>>>>
>>>>    - How can I detect wide rows?
>>>>    - Anyone know what debug level I can set so that I can see what
>>>>    reads the hot node is handling?  I’m hoping to see the “bad” row
>>>>    - Any way to get the server to blacklist these wide rows
>>>>    automatically?
>>>>
>>>> We’re using C* 2.0.6 with Vnodes.
>>>>
>>>> Thanks
>>>>
>>>
>>>
>>
>>
>
>


-- 
*Ken Hancock *| System Architect, Advanced Advertising
SeaChange International
50 Nagog Park
Acton, Massachusetts 01720
ken.hancock@schange.com | www.schange.com | NASDAQ:SEAC
<http://www.schange.com/en-US/Company/InvestorRelations.aspx>
Office: +1 (978) 889-3329 | [image: Google Talk:]
ken.hancock@schange.com | [image:
Skype:]hancockks | [image: Yahoo IM:]hancockks [image: LinkedIn]
<http://www.linkedin.com/in/kenhancock>

[image: SeaChange International]
 <http://www.schange.com/>This e-mail and any attachments may contain
information which is SeaChange International confidential. The information
enclosed is intended only for the addressees herein and may not be copied
or forwarded without permission from SeaChange International.

Re: Hot, large row

Posted by DuyHai Doan <do...@gmail.com>.

Hello Keith

   1. Periodically seeing one node stuck in CMS GC causing high read
   latency.  Seems to recover on its own after an hour or so

How many nodes do you have ? And how many distinct user_id roughtly is
there ?

Looking at your jvm settings it seems that you have the GC log enabled. It
worths having a look into it. And also grep for the pattern "GC for" in the
Casssandra system.log file

The symptom you mention looks like there are activity bursts on one
particular node. The rows are not so wide since the largest has only 61k of
cells and C* can deal with rows larger than that. It all depends now on
your data access pattern.

Also Jack Krupansky question is interesting. Even though you limit a
request to 5000, if each cell is a big blob or block of text, it mays add
up a lot into JVM heap ...

Did you try to do a select without limit and use paging feature of Java
Driver. Or lower the limit in the select to 500 as Duncan said and paginate
manually

Hope that helps

Duy Hai

On Fri, Jul 25, 2014 at 3:10 PM, Duncan Sands <du...@gmail.com>
wrote:

> Hi Keith,
>
>
> On 25/07/14 14:43, Keith Wright wrote:
>
>> Answers to your questions below but in the end I believe the root issue
>> here is
>> that LCS is clearly not compacting away as it should resulting in reads
>> across
>> many SSTables which as you noted is “fishy”.   I’m considering filing a
>> JIRA for
>> this, sound reasonable?
>>
>> We are running OOTB JMV tuning (see below) and using the datastax client.
>>  When
>> we read from the table in question, we put a limit of 5000 to help reduce
>> the
>> read volume but yes the standard scenario is:  “select * from
>> global_user_event_skus_v2 where user_id = ? limit 5000”
>>
>
> does reducing the limit, eg to 500, help?  I've had similar sounding
> problems when many clients were doing wide row reads in parallel, with the
> reads returning thousands of rows.
>
> Ciao, Duncan.
>

Re: Hot, large row

Posted by Duncan Sands <du...@gmail.com>.

Hi Keith,

On 25/07/14 14:43, Keith Wright wrote:
> Answers to your questions below but in the end I believe the root issue here is
> that LCS is clearly not compacting away as it should resulting in reads across
> many SSTables which as you noted is “fishy”.   I’m considering filing a JIRA for
> this, sound reasonable?
>
> We are running OOTB JMV tuning (see below) and using the datastax client.  When
> we read from the table in question, we put a limit of 5000 to help reduce the
> read volume but yes the standard scenario is:  “select * from
> global_user_event_skus_v2 where user_id = ? limit 5000”

does reducing the limit, eg to 500, help?  I've had similar sounding problems 
when many clients were doing wide row reads in parallel, with the reads 
returning thousands of rows.

Ciao, Duncan.

Re: Hot, large row

Posted by Keith Wright <kw...@nanigans.com>.

Answers to your questions below but in the end I believe the root issue here is that LCS is clearly not compacting away as it should resulting in reads across many SSTables which as you noted is “fishy”.   I’m considering filing a JIRA for this, sound reasonable?

We are running OOTB JMV tuning (see below) and using the datastax client.  When we read from the table in question, we put a limit of 5000 to help reduce the read volume but yes the standard scenario is:  “select * from global_user_event_skus_v2 where user_id = ? limit 5000”

If you recall, there are 2 issues occurring.  Turning on client side paging would just mask these IMO:

 1.  Periodically seeing one node stuck in CMS GC causing high read latency.  Seems to recover on its own after an hour or so
 2.  Generally bad read/write times for some LCS tables that have multiple updates over time

Thanks!


-XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms8018M -Xmx8018M -Xmn2004M -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=1000003 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:+UseCondCardMark -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintHeapAtGC -XX:+PrintGCApplicationStoppedTime -XX:+PrintGCApplicationConcurrentTime -Xloggc:/var/log/cassandra/gc.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=10M

From: DuyHai Doan <do...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Friday, July 25, 2014 at 2:53 AM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Hot, large row


What are your jvm  settings?

Your read pattern implies that you may fetch lots of data into memory (reading all skus for a given user), maybe it stressed too much the jvm.

Did you use native paging of Java Driver to avoid loading all columns at a time?

And the loading all skus for one user, is it a rare scenario or is it your main use case for this colum family?

Le 25 juil. 2014 04:11, "Keith Wright" <kw...@nanigans.com>> a écrit :
One last item to add to this thread:  we have consistently experienced this behavior where over time performance degrades (previously we were unable to bootstrap nodes to due long GC pauses from existing nodes).  I believe its due to tombstone build up (as I mentioned previously one of the tables mentioned is showing a droppable tombstone ratio of > 30%).   The sku table is used to hold SKUs that recent users viewed/purchased.  When we write a new SKU we set the TTL to 30 days where the row key is the user id; our read case is to fetch ALL skus the user has seen within the TTL.  Since the user sees SKUs consistently over time, this can result in a row with many columns much of which are likely tombstoned (see CASSANDRA-6654 which I filed for this which shows that C* does not handle this case well).

I guess I’m just surprised that others aren’t using C* for similar usage cases and thus having the same issue?

 I am hoping to upgrade to 2.0.9 which has improvements to remove tombstones.

From: Keith Wright <kw...@nanigans.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Thursday, July 24, 2014 at 4:50 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Cc: Don Jackson <dj...@nanigans.com>>
Subject: Re: Hot, large row

When a node is showing the high CMS issue, io is actually low likely due to the fact that none is going on during CMS GC.  On a node not showing the issue, iostat shows disk usage around 50% (these are SSD) and load hovers around 10 for a dual octo core machine this is fine.

In addition, nodetool compactionstats does not show that we are falling behind in compactions.

So not sure what is going on here.  We are running CentOS 6.5 with java 1.7.0_51.  It does seem like things are getting worse and I’m considering dropping and rebuilding all the tables (as I have the data in Hadoop).  This seems to be a repeated problem for us with Cassandra and now that Aerospike has an open source version, we are very much considering switching.

Thanks again for the help and any insight you might have!


avg-cpu:  %user   %nice %system %iowait  %steal   %idle

          23.43   12.40   11.40    6.20    0.00   46.57


Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util

sda               0.00     0.00  720.50    0.00     5.51     0.00    15.66     1.05    1.54   0.57  41.35

sdc               0.00     0.00 5930.50 2100.50    55.42     8.21    16.23     2.30    0.29   0.06  51.25

sdb               0.00     0.00 6951.50 2052.50    65.82     8.02    16.80     4.31    0.48   0.07  59.60


avg-cpu:  %user   %nice %system %iowait  %steal   %idle

           9.48   14.72    5.60    3.67    0.00   66.52


Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util

sda               0.00     0.00  327.00   19.50     2.55     0.08    15.53     0.04    0.13   0.12   4.25

sdc               0.00     0.00 3974.50 1403.50    36.37     5.48    15.94     0.99    0.18   0.08  45.10

sdb               0.00     0.00 4357.50 1535.00    40.83     6.00    16.28     1.10    0.19   0.08  47.45

From: DuyHai Doan <do...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Thursday, July 24, 2014 at 4:44 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Hot, large row

For global_user_event_skus_v2

1. number of SSTables per read is quite huge. Considering you're using LCS, it means that LCS cannot keep up with write rate and is left behind. AFAIK LCS is using SizeTieredCompaction at L0 to cope with extreme write burst. Your high number of SSTables per read is quite fishy here.

2. Write latency is widespead up to 73.457 millisecs, meaning that your node is getting behind for write in some cases. Most of write are still below 1 millisec but we don't care. What we care here is the large trail of write latency climbing up to 73 millisecs

3. Same remarks for read latency which is worst because the distribution is even "flatter", worst cases going up to 100 ms.

If I were you, I'll check for disk I/O first and maybe CPU usage


On Thu, Jul 24, 2014 at 10:32 PM, Keith Wright <kw...@nanigans.com>> wrote:
Cfhistograms for the tables I believe are most likely the issue are below on the node that most recently presented the issue.  Any ideas?  Note that these tables are LCS and have droppable tombstone ratios of 27% for global_user_event_skus_v2 and 2.7% for the other.  Table definitions also below, you can see that they do use compound keys.

Thanks


CREATE TABLE global_user_event_skus_v2 (

  user_id timeuuid,

  app_id int,

  type text,

  name text,

  sku text,

  creationtime timestamp,

  PRIMARY KEY (user_id, app_id, type, name, sku)

) WITH

  bloom_filter_fp_chance=0.100000 AND

  caching='KEYS_ONLY' AND

  comment='' AND

  dclocal_read_repair_chance=0.000000 AND

  gc_grace_seconds=86400 AND

  index_interval=128 AND

  read_repair_chance=0.100000 AND

  replicate_on_write='true' AND

  populate_io_cache_on_flush='false' AND

  default_time_to_live=2592000 AND

  speculative_retry='99.0PERCENTILE' AND

  memtable_flush_period_in_ms=0 AND

  compaction={'class': 'LeveledCompactionStrategy'} AND

  compression={'sstable_compression': 'LZ4Compressor'};



users/global_user_event_skus_v2 histograms


SSTables per Read

 1 sstables: 1398122

 2 sstables: 231100

 3 sstables: 97334

 4 sstables: 56496

 5 sstables: 43046

 6 sstables: 35471

 7 sstables: 26888

 8 sstables: 21102

10 sstables: 29522

12 sstables: 18314

14 sstables: 11368

17 sstables: 10283

20 sstables: 5766

24 sstables: 3554

29 sstables: 1681

35 sstables: 521

42 sstables: 102

50 sstables: 1


Write Latency (microseconds)

     2 us: 532

     3 us: 10728

     4 us: 49608

     5 us: 173718

     6 us: 404131

     7 us: 697468

     8 us: 981048

    10 us: 2643680

    12 us: 3183727

    14 us: 3150881

    17 us: 3746333

    20 us: 2298481

    24 us: 1647604

    29 us: 1070733

    35 us: 693300

    42 us: 442120

    50 us: 288618

    60 us: 205061

    72 us: 141194

    86 us: 98781

   103 us: 73629

   124 us: 54900

   149 us: 40518

   179 us: 30330

   215 us: 23953

   258 us: 18875

   310 us: 14147

   372 us: 10509

   446 us: 7984

   535 us: 6257

   642 us: 4903

   770 us: 4086

   924 us: 3252

  1109 us: 2676

  1331 us: 2129

  1597 us: 1753

  1916 us: 1488

  2299 us: 1237

  2759 us: 1078

  3311 us: 866

  3973 us: 704

  4768 us: 551

  5722 us: 553

  6866 us: 774

  8239 us: 443

  9887 us: 303

 11864 us: 222

 14237 us: 181

 17084 us: 146

 20501 us: 147

 24601 us: 499

 29521 us: 842

 35425 us: 548

 42510 us: 303

 51012 us: 560

 61214 us: 842

 73457 us: 600

 88148 us: 283

105778 us: 173

126934 us: 81

152321 us: 18

182785 us: 2


Read Latency (microseconds)

     35 us: 2

     42 us: 72

     50 us: 737

     60 us: 3555

     72 us: 6926

     86 us: 8373

    103 us: 15287

    124 us: 30054

    149 us: 46164

    179 us: 67828

    215 us: 86741

    258 us: 92178

    310 us: 82205

    372 us: 69641

    446 us: 64710

    535 us: 58295

    642 us: 55247

    770 us: 57918

    924 us: 56928

   1109 us: 75762

   1331 us: 122306

   1597 us: 141564

   1916 us: 117301

   2299 us: 102293

   2759 us: 98789

   3311 us: 86923

   3973 us: 75304

   4768 us: 64012

   5722 us: 54973

   6866 us: 47221

   8239 us: 41483

   9887 us: 35015

  11864 us: 30249

  14237 us: 23505

  17084 us: 18364

  20501 us: 12992

  24601 us: 8837

  29521 us: 6497

  35425 us: 4840

  42510 us: 3536

  51012 us: 2471

  61214 us: 2784

  73457 us: 3664

  88148 us: 3219

 105778 us: 1980

 126934 us: 1050

 152321 us: 519

 182785 us: 184

 219342 us: 82

 263210 us: 39

 315852 us: 22

 379022 us: 16

 454826 us: 0

 545791 us: 0

 654949 us: 1

 785939 us: 1

 943127 us: 0

1131752 us: 0

1358102 us: 0

1629722 us: 3

1955666 us: 3

2346799 us: 1

2816159 us: 0

3379391 us: 0

4055269 us: 0

4866323 us: 0

5839588 us: 0

7007506 us: 4


Partition Size (bytes)

     35 bytes: 31

     42 bytes: 0

     50 bytes: 0

     60 bytes: 0

     72 bytes: 0

     86 bytes: 0

    103 bytes: 761108

    124 bytes: 23596164

    149 bytes: 22963275

    179 bytes: 3517971

    215 bytes: 6395335

    258 bytes: 6197982

    310 bytes: 4662589

    372 bytes: 5261934

    446 bytes: 3032032

    535 bytes: 3775310

    642 bytes: 2879935

    770 bytes: 3002743

    924 bytes: 2279900

   1109 bytes: 1976357

   1331 bytes: 1643875

   1597 bytes: 1422199

   1916 bytes: 1142746

   2299 bytes: 901584

   2759 bytes: 760173

   3311 bytes: 596874

   3973 bytes: 480502

   4768 bytes: 380185

   5722 bytes: 303315

   6866 bytes: 239013

   8239 bytes: 187337

   9887 bytes: 147218

  11864 bytes: 115684

  14237 bytes: 89718

  17084 bytes: 69160

  20501 bytes: 53070

  24601 bytes: 40496

  29521 bytes: 30344

  35425 bytes: 22656

  42510 bytes: 16419

  51012 bytes: 11621

  61214 bytes: 8020

  73457 bytes: 5561

  88148 bytes: 3708

 105778 bytes: 2434

 126934 bytes: 1561

 152321 bytes: 1017

 182785 bytes: 559

 219342 bytes: 349

 263210 bytes: 181

 315852 bytes: 121

 379022 bytes: 66

 454826 bytes: 22

 545791 bytes: 14

 654949 bytes: 10

 785939 bytes: 2

 943127 bytes: 0

1131752 bytes: 4

1358102 bytes: 2

1629722 bytes: 1

1955666 bytes: 2

2346799 bytes: 0

2816159 bytes: 0

3379391 bytes: 0

4055269 bytes: 0

4866323 bytes: 1


Cell Count per Partition

    1 cells: 50599458

    2 cells: 13975363

    3 cells: 7927898

    4 cells: 4981142

    5 cells: 3081166

    6 cells: 3034895

    7 cells: 1896720

    8 cells: 1508849

   10 cells: 2418845

   12 cells: 1631289

   14 cells: 1179269

   17 cells: 1301647

   20 cells: 919744

   24 cells: 864313

   29 cells: 735203

   35 cells: 597749

   42 cells: 469611

   50 cells: 367422

   60 cells: 312697

   72 cells: 251782

   86 cells: 197717

  103 cells: 161244

  124 cells: 132074

  149 cells: 103711

  179 cells: 81760

  215 cells: 63360

  258 cells: 48982

  310 cells: 37932

  372 cells: 28328

  446 cells: 21347

  535 cells: 15373

  642 cells: 10959

  770 cells: 7643

  924 cells: 5346

 1109 cells: 3613

 1331 cells: 2291

 1597 cells: 1558

 1916 cells: 885

 2299 cells: 577

 2759 cells: 327

 3311 cells: 169

 3973 cells: 116

 4768 cells: 57

 5722 cells: 23

 6866 cells: 15

 8239 cells: 9

 9887 cells: 3

11864 cells: 0

14237 cells: 5

17084 cells: 1

20501 cells: 0

24601 cells: 2

29521 cells: 0

35425 cells: 0

42510 cells: 0

51012 cells: 0

61214 cells: 1


CREATE TABLE global_user_events (

  user_id timeuuid,

  app_id int,

  type text,

  name text,

  extra_param map<text, text>,

  paid boolean,

  value float,

  PRIMARY KEY (user_id, app_id, type, name)

) WITH

  bloom_filter_fp_chance=0.100000 AND

  caching='KEYS_ONLY' AND

  comment='' AND

  dclocal_read_repair_chance=0.000000 AND

  gc_grace_seconds=86400 AND

  index_interval=128 AND

  read_repair_chance=0.100000 AND

  replicate_on_write='true' AND

  populate_io_cache_on_flush='false' AND

  default_time_to_live=0 AND

  speculative_retry='99.0PERCENTILE' AND

  memtable_flush_period_in_ms=0 AND

  compaction={'class': 'LeveledCompactionStrategy'} AND

  compression={'sstable_compression': 'LZ4Compressor'};


users/global_user_events histograms


SSTables per Read

 1 sstables: 1024446

 2 sstables: 632948

 3 sstables: 230768

 4 sstables: 112428

 5 sstables: 82651

 6 sstables: 61439

 7 sstables: 47518

 8 sstables: 38490

10 sstables: 58054

12 sstables: 41128

14 sstables: 29642

17 sstables: 31056

20 sstables: 20474

24 sstables: 17799

29 sstables: 13744

35 sstables: 9485

42 sstables: 5933

50 sstables: 3607

60 sstables: 1618

72 sstables: 263

86 sstables: 10


Write Latency (microseconds)

     10 us: 10

     12 us: 75

     14 us: 319

     17 us: 3550

     20 us: 20617

     24 us: 105254

     29 us: 311290

     35 us: 494404

     42 us: 516301

     50 us: 534061

     60 us: 421438

     72 us: 314377

     86 us: 375240

    103 us: 333402

    124 us: 816891

    149 us: 667832

    179 us: 430659

    215 us: 336063

    258 us: 380606

    310 us: 235592

    372 us: 68679

    446 us: 14988

    535 us: 6169

    642 us: 3763

    770 us: 2591

    924 us: 2270

   1109 us: 1985

   1331 us: 1744

   1597 us: 1564

   1916 us: 1489

   2299 us: 1326

   2759 us: 1114

   3311 us: 1024

   3973 us: 984

   4768 us: 833

   5722 us: 962

   6866 us: 1076

   8239 us: 891

   9887 us: 611

  11864 us: 489

  14237 us: 324

  17084 us: 306

  20501 us: 222

  24601 us: 426

  29521 us: 791

  35425 us: 820

  42510 us: 534

  51012 us: 530

  61214 us: 946

  73457 us: 1122

  88148 us: 714

 105778 us: 313

 126934 us: 127

 152321 us: 38

 182785 us: 3

 219342 us: 2

 263210 us: 0

 315852 us: 0

 379022 us: 0

 454826 us: 2

 545791 us: 0

 654949 us: 3

 785939 us: 0

 943127 us: 0

1131752 us: 0

1358102 us: 0

1629722 us: 0

1955666 us: 0

2346799 us: 1

2816159 us: 0

3379391 us: 0

4055269 us: 0

4866323 us: 0

5839588 us: 0

7007506 us: 1


Read Latency (microseconds)

     42 us: 1

     50 us: 2

     60 us: 2

     72 us: 4

     86 us: 15

    103 us: 23

    124 us: 29

    149 us: 48

    179 us: 67

    215 us: 80

    258 us: 316

    310 us: 1812

    372 us: 4999

    446 us: 8876

    535 us: 14586

    642 us: 24417

    770 us: 31943

    924 us: 37453

   1109 us: 77449

   1331 us: 139135

   1597 us: 177538

   1916 us: 169568

   2299 us: 177617

   2759 us: 202552

   3311 us: 198101

   3973 us: 179807

   4768 us: 159149

   5722 us: 142282

   6866 us: 126124

   8239 us: 111529

   9887 us: 97658

  11864 us: 84913

  14237 us: 70905

  17084 us: 54917

  20501 us: 40623

  24601 us: 29545

  29521 us: 21448

  35425 us: 15255

  42510 us: 11222

  51012 us: 8150

  61214 us: 7473

  73457 us: 10060

  88148 us: 10530

 105778 us: 7573

 126934 us: 4331

 152321 us: 2097

 182785 us: 836

 219342 us: 255

 263210 us: 84

 315852 us: 28

 379022 us: 8

 454826 us: 2

 545791 us: 0

 654949 us: 2

 785939 us: 1

 943127 us: 0

1131752 us: 0

1358102 us: 0

1629722 us: 2

1955666 us: 6

2346799 us: 18

2816159 us: 0

3379391 us: 0

4055269 us: 0

4866323 us: 0

5839588 us: 0

7007506 us: 34


Partition Size (bytes)

    35 bytes: 25

    42 bytes: 0

    50 bytes: 0

    60 bytes: 0

    72 bytes: 21761

    86 bytes: 694466658

   103 bytes: 219007195

   124 bytes: 7034596

   149 bytes: 110269079

   179 bytes: 101276811

   215 bytes: 38518027

   258 bytes: 103242356

   310 bytes: 51434914

   372 bytes: 38048245

   446 bytes: 42558831

   535 bytes: 32869176

   642 bytes: 17188026

   770 bytes: 14750559

   924 bytes: 15963256

  1109 bytes: 16125324

  1331 bytes: 11462272

  1597 bytes: 6810452

  1916 bytes: 5685739

  2299 bytes: 10232324

  2759 bytes: 5138522

  3311 bytes: 5847998

  3973 bytes: 4065752

  4768 bytes: 3468263

  5722 bytes: 2545431

  6866 bytes: 1701913

  8239 bytes: 993808

  9887 bytes: 495282

 11864 bytes: 208644

 14237 bytes: 74307

 17084 bytes: 22286

 20501 bytes: 5474

 24601 bytes: 1225

 29521 bytes: 210

 35425 bytes: 97

 42510 bytes: 48

 51012 bytes: 21

 61214 bytes: 15

 73457 bytes: 10

 88148 bytes: 3

105778 bytes: 4

126934 bytes: 1

152321 bytes: 2

182785 bytes: 1


Cell Count per Partition

   1 cells: 914323128

   2 cells: 196020828

   3 cells: 165502164

   4 cells: 71735648

   5 cells: 54450466

   6 cells: 28995756

   7 cells: 9625976

   8 cells: 10257788

  10 cells: 24004757

  12 cells: 13087147

  14 cells: 13588347

  17 cells: 11976353

  20 cells: 6245374

  24 cells: 5627821

  29 cells: 9480533

  35 cells: 6008276

  42 cells: 5221357

  50 cells: 4355454

  60 cells: 3955488

  72 cells: 2872578

  86 cells: 1914447

 103 cells: 1216276

 124 cells: 652035

 149 cells: 279565

 179 cells: 99836

 215 cells: 29132

 258 cells: 6682

 310 cells: 1276

 372 cells: 258

 446 cells: 96

 535 cells: 48

 642 cells: 25

 770 cells: 9

 924 cells: 9

1109 cells: 4

1331 cells: 3

1597 cells: 1

1916 cells: 2

From: Jack Krupansky <ja...@basetechnology.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Thursday, July 24, 2014 at 4:07 PM

To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Hot, large row

Could it be some “fat columns” (cells with large blob or text values) rather than the cell-count per se? IOW, a “big row” rather than a “wide row”?

And, could it be a large partition rather than a large row (many rows in a single partition)? Are clustering columns being used in the primary key?

-- Jack Krupansky

From: DuyHai Doan<ma...@gmail.com>
Sent: Thursday, July 24, 2014 3:53 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: Hot, large row

Your extract of cfhistograms show that there are no particular "wide rows". The widest has 61214 cells which is big but not that huge to be really a concern.

Turning on trace probabilty only tells give you some "hints" about what kind of queries are done, it does not give the exact partition key nor other statement values, especially when you are using prepared statements ...


"I am considering reducing read_request_timeout_in_ms: 5000 in cassandra.yaml so that it reduces the impact when this occurs." --> Don't do that, you'll only sweep dust under the carpet. Find the real issue and fix it instead of changing parameter to hide it.

One solution would be on client side, to activate some logging to show the CQL3 statements the application is issuing that may overload the server.  I know that's better said than done but I don't have any other idea for the moment

-------- Shameless self-promotion ------

To support this kind of live prod debugging & investigation that I add a new dynamic query logging feature in Achilles: https://github.com/doanduyhai/Achilles/wiki/Statements-Logging-and-Tracing#dynamic-statements-logging

Once you hit the issue, this kind of feature may save your day...






On Thu, Jul 24, 2014 at 9:22 PM, Keith Wright <kw...@nanigans.com>> wrote:
I can see from cfhistograms that I do have some wide rows (see below).  I set trace probability as you suggested but the output doesn’t appear to tell me what row was actually read unless I missed something.  I just see executing prepared statement.   Any ideas how I can find the row in question?

I am considering reducing read_request_timeout_in_ms: 5000 in cassandra.yaml so that it reduces the impact when this occurs.

Any help in identifying my issue would be GREATLY appreciated


Cell Count per Partition

    1 cells: 50449950

    2 cells: 14281828

    3 cells: 8093366

    4 cells: 5029200

    5 cells: 3103023<tel:3103023>

    6 cells: 3059903

    7 cells: 1903018

    8 cells: 1509297

   10 cells: 2420359

   12 cells: 1624895

   14 cells: 1171678

   17 cells: 1289391

   20 cells: 909777

   24 cells: 852081

   29 cells: 722925

   35 cells: 587067

   42 cells: 459473

   50 cells: 358744

   60 cells: 304146

   72 cells: 244682

   86 cells: 191045

  103 cells: 155337

  124 cells: 127061

  149 cells: 98913

  179 cells: 77454

  215 cells: 59849

  258 cells: 46117

  310 cells: 35321

  372 cells: 26319

  446 cells: 19379

  535 cells: 13783

  642 cells: 9993

  770 cells: 6973

  924 cells: 4713

1109 cells: 3229

1331 cells: 2062

1597 cells: 1338

1916 cells: 773

2299 cells: 495

2759 cells: 268

3311 cells: 150

3973 cells: 100

4768 cells: 42

5722 cells: 24

6866 cells: 12

8239 cells: 9

9887 cells: 3

11864 cells: 0

14237 cells: 5

17084 cells: 1

20501 cells: 0

24601 cells: 2

29521 cells: 0

35425 cells: 0

42510 cells: 0

51012 cells: 0

61214 cells: 2


From: DuyHai Doan <do...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Thursday, July 24, 2014 at 3:01 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Hot, large row

"How can I detect wide rows?" -->

nodetool cfhistograms <keyspace> <suspected column family>

Look at column "Column count" (last column) and identify a line in this column with very high value of "Offset". In a well designed application you should have a gaussian distribution where 80% of your row have a similar number of columns.

"Anyone know what debug level I can set so that I can see what reads the hot node is handling?  " -->

"nodetool settraceprobability <value>",  where value is a small number (0.001) on the node where you encounter the issue. Activate the tracing for a while (5 mins) and deactivate it (value = 0). Then look into system_traces tables "events" & "sessions". It may help or not since the tracing is done once every 1000.

"Any way to get the server to blacklist these wide rows automatically?" --> No


On Thu, Jul 24, 2014 at 8:48 PM, Keith Wright <kw...@nanigans.com>> wrote:
Hi all,

   We are seeing an issue where basically daily one of our nodes spikes in load and is churning in CMS heap pressure.  It appears that reads are backing up and my guess is that our application is reading a large row repeatedly.  Our write structure can lead itself to wide rows very infrequently (<0.001%) and we do our best to detect and delete them but obviously we’re missing a case.  Hoping for assistance on the following questions:

 *   How can I detect wide rows?
 *   Anyone know what debug level I can set so that I can see what reads the hot node is handling?  I’m hoping to see the “bad” row
 *   Any way to get the server to blacklist these wide rows automatically?

We’re using C* 2.0.6 with Vnodes.

Thanks

Re: Hot, large row

Posted by DuyHai Doan <do...@gmail.com>.

What are your jvm  settings?

Your read pattern implies that you may fetch lots of data into memory
(reading all skus for a given user), maybe it stressed too much the jvm.

Did you use native paging of Java Driver to avoid loading all columns at a
time?

And the loading all skus for one user, is it a rare scenario or is it your
main use case for this colum family?
 Le 25 juil. 2014 04:11, "Keith Wright" <kw...@nanigans.com> a écrit :

> One last item to add to this thread:  we have consistently experienced
> this behavior where over time performance degrades (previously we were
> unable to bootstrap nodes to due long GC pauses from existing nodes).  I
> believe its due to tombstone build up (as I mentioned previously one of the
> tables mentioned is showing a droppable tombstone ratio of > 30%).   The
> sku table is used to hold SKUs that recent users viewed/purchased.  When we
> write a new SKU we set the TTL to 30 days where the row key is the user id;
> our read case is to fetch ALL skus the user has seen within the TTL.  Since
> the user sees SKUs consistently over time, this can result in a row with
> many columns much of which are likely tombstoned (see CASSANDRA-6654 which
> I filed for this which shows that C* does not handle this case well).
>
> I guess I’m just surprised that others aren’t using C* for similar usage
> cases and thus having the same issue?
>
>  I am hoping to upgrade to 2.0.9 which has improvements to remove
> tombstones.
>
> From: Keith Wright <kw...@nanigans.com>
> Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
> Date: Thursday, July 24, 2014 at 4:50 PM
> To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
> Cc: Don Jackson <dj...@nanigans.com>
> Subject: Re: Hot, large row
>
> When a node is showing the high CMS issue, io is actually low likely due
> to the fact that none is going on during CMS GC.  On a node not showing the
> issue, iostat shows disk usage around 50% (these are SSD) and load hovers
> around 10 for a dual octo core machine this is fine.
>
> In addition, nodetool compactionstats does not show that we are falling
> behind in compactions.
>
> So not sure what is going on here.  We are running CentOS 6.5 with java
> 1.7.0_51.  It does seem like things are getting worse and I’m considering
> dropping and rebuilding all the tables (as I have the data in Hadoop).
>  This seems to be a repeated problem for us with Cassandra and now that
> Aerospike has an open source version, we are very much considering
> switching.
>
> Thanks again for the help and any insight you might have!
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>
>           23.43   12.40   11.40    6.20    0.00   46.57
>
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz
> avgqu-sz   await  svctm  %util
>
> sda               0.00     0.00  720.50    0.00     5.51     0.00    15.66
>     1.05    1.54   0.57  41.35
>
> sdc               0.00     0.00 5930.50 2100.50    55.42     8.21    16.23
>     2.30    0.29   0.06  51.25
>
> sdb               0.00     0.00 6951.50 2052.50    65.82     8.02    16.80
>     4.31    0.48   0.07  59.60
>
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>
>            9.48   14.72    5.60    3.67    0.00   66.52
>
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz
> avgqu-sz   await  svctm  %util
>
> sda               0.00     0.00  327.00   19.50     2.55     0.08    15.53
>     0.04    0.13   0.12   4.25
>
> sdc               0.00     0.00 3974.50 1403.50    36.37     5.48    15.94
>     0.99    0.18   0.08  45.10
>
> sdb               0.00     0.00 4357.50 1535.00    40.83     6.00    16.28
>     1.10    0.19   0.08  47.45
>
> From: DuyHai Doan <do...@gmail.com>
> Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
> Date: Thursday, July 24, 2014 at 4:44 PM
> To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
> Subject: Re: Hot, large row
>
> For global_user_event_skus_v2
>
> 1. number of SSTables per read is quite huge. Considering you're using
> LCS, it means that LCS cannot keep up with write rate and is left behind.
> AFAIK LCS is using SizeTieredCompaction at L0 to cope with extreme write
> burst. Your high number of SSTables per read is quite fishy here.
>
> 2. Write latency is widespead up to 73.457 millisecs, meaning that your
> node is getting behind for write in some cases. Most of write are still
> below 1 millisec but we don't care. What we care here is the large trail of
> write latency climbing up to 73 millisecs
>
> 3. Same remarks for read latency which is worst because the distribution
> is even "flatter", worst cases going up to 100 ms.
>
> If I were you, I'll check for disk I/O first and maybe CPU usage
>
>
> On Thu, Jul 24, 2014 at 10:32 PM, Keith Wright <kw...@nanigans.com>
> wrote:
>
>> Cfhistograms for the tables I believe are most likely the issue are below
>> on the node that most recently presented the issue.  Any ideas?  Note that
>> these tables are LCS and have droppable tombstone ratios of 27% for
>> global_user_event_skus_v2 and 2.7% for the other.  Table definitions also
>> below, you can see that they do use compound keys.
>>
>> Thanks
>>
>> CREATE TABLE global_user_event_skus_v2 (
>>
>>   user_id timeuuid,
>>
>>   app_id int,
>>
>>   type text,
>>
>>   name text,
>>
>>   sku text,
>>
>>   creationtime timestamp,
>>
>>   PRIMARY KEY (user_id, app_id, type, name, sku)
>>
>> ) WITH
>>
>>   bloom_filter_fp_chance=0.100000 AND
>>
>>   caching='KEYS_ONLY' AND
>>
>>   comment='' AND
>>
>>   dclocal_read_repair_chance=0.000000 AND
>>
>>   gc_grace_seconds=86400 AND
>>
>>   index_interval=128 AND
>>
>>   read_repair_chance=0.100000 AND
>>
>>   replicate_on_write='true' AND
>>
>>   populate_io_cache_on_flush='false' AND
>>
>>   default_time_to_live=2592000 AND
>>
>>   speculative_retry='99.0PERCENTILE' AND
>>
>>   memtable_flush_period_in_ms=0 AND
>>
>>   compaction={'class': 'LeveledCompactionStrategy'} AND
>>
>>   compression={'sstable_compression': 'LZ4Compressor'};
>>
>>
>> users/global_user_event_skus_v2 histograms
>>
>>
>> SSTables per Read
>>
>>  1 sstables: 1398122
>>
>>  2 sstables: 231100
>>
>>  3 sstables: 97334
>>
>>  4 sstables: 56496
>>
>>  5 sstables: 43046
>>
>>  6 sstables: 35471
>>
>>  7 sstables: 26888
>>
>>  8 sstables: 21102
>>
>> 10 sstables: 29522
>>
>> 12 sstables: 18314
>>
>> 14 sstables: 11368
>>
>> 17 sstables: 10283
>>
>> 20 sstables: 5766
>>
>> 24 sstables: 3554
>>
>> 29 sstables: 1681
>>
>> 35 sstables: 521
>>
>> 42 sstables: 102
>>
>> 50 sstables: 1
>>
>>
>> Write Latency (microseconds)
>>
>>      2 us: 532
>>
>>      3 us: 10728
>>
>>      4 us: 49608
>>
>>      5 us: 173718
>>
>>      6 us: 404131
>>
>>      7 us: 697468
>>
>>      8 us: 981048
>>
>>     10 us: 2643680
>>
>>     12 us: 3183727
>>
>>     14 us: 3150881
>>
>>     17 us: 3746333
>>
>>     20 us: 2298481
>>
>>     24 us: 1647604
>>
>>     29 us: 1070733
>>
>>     35 us: 693300
>>
>>     42 us: 442120
>>
>>     50 us: 288618
>>
>>     60 us: 205061
>>
>>     72 us: 141194
>>
>>     86 us: 98781
>>
>>    103 us: 73629
>>
>>    124 us: 54900
>>
>>    149 us: 40518
>>
>>    179 us: 30330
>>
>>    215 us: 23953
>>
>>    258 us: 18875
>>
>>    310 us: 14147
>>
>>    372 us: 10509
>>
>>    446 us: 7984
>>
>>    535 us: 6257
>>
>>    642 us: 4903
>>
>>    770 us: 4086
>>
>>    924 us: 3252
>>
>>   1109 us: 2676
>>
>>   1331 us: 2129
>>
>>   1597 us: 1753
>>
>>   1916 us: 1488
>>
>>   2299 us: 1237
>>
>>   2759 us: 1078
>>
>>   3311 us: 866
>>
>>   3973 us: 704
>>
>>   4768 us: 551
>>
>>   5722 us: 553
>>
>>   6866 us: 774
>>
>>   8239 us: 443
>>
>>   9887 us: 303
>>
>>  11864 us: 222
>>
>>  14237 us: 181
>>
>>  17084 us: 146
>>
>>  20501 us: 147
>>
>>  24601 us: 499
>>
>>  29521 us: 842
>>
>>  35425 us: 548
>>
>>  42510 us: 303
>>
>>  51012 us: 560
>>
>>  61214 us: 842
>>
>>  73457 us: 600
>>
>>  88148 us: 283
>>
>> 105778 us: 173
>>
>> 126934 us: 81
>>
>> 152321 us: 18
>>
>> 182785 us: 2
>>
>>
>> Read Latency (microseconds)
>>
>>      35 us: 2
>>
>>      42 us: 72
>>
>>      50 us: 737
>>
>>      60 us: 3555
>>
>>      72 us: 6926
>>
>>      86 us: 8373
>>
>>     103 us: 15287
>>
>>     124 us: 30054
>>
>>     149 us: 46164
>>
>>     179 us: 67828
>>
>>     215 us: 86741
>>
>>     258 us: 92178
>>
>>     310 us: 82205
>>
>>     372 us: 69641
>>
>>     446 us: 64710
>>
>>     535 us: 58295
>>
>>     642 us: 55247
>>
>>     770 us: 57918
>>
>>     924 us: 56928
>>
>>    1109 us: 75762
>>
>>    1331 us: 122306
>>
>>    1597 us: 141564
>>
>>    1916 us: 117301
>>
>>    2299 us: 102293
>>
>>    2759 us: 98789
>>
>>    3311 us: 86923
>>
>>    3973 us: 75304
>>
>>    4768 us: 64012
>>
>>    5722 us: 54973
>>
>>    6866 us: 47221
>>
>>    8239 us: 41483
>>
>>    9887 us: 35015
>>
>>   11864 us: 30249
>>
>>   14237 us: 23505
>>
>>   17084 us: 18364
>>
>>   20501 us: 12992
>>
>>   24601 us: 8837
>>
>>   29521 us: 6497
>>
>>   35425 us: 4840
>>
>>   42510 us: 3536
>>
>>   51012 us: 2471
>>
>>   61214 us: 2784
>>
>>   73457 us: 3664
>>
>>   88148 us: 3219
>>
>>  105778 us: 1980
>>
>>  126934 us: 1050
>>
>>  152321 us: 519
>>
>>  182785 us: 184
>>
>>  219342 us: 82
>>
>>  263210 us: 39
>>
>>  315852 us: 22
>>
>>  379022 us: 16
>>
>>  454826 us: 0
>>
>>  545791 us: 0
>>
>>  654949 us: 1
>>
>>  785939 us: 1
>>
>>  943127 us: 0
>>
>> 1131752 us: 0
>>
>> 1358102 us: 0
>>
>> 1629722 us: 3
>>
>> 1955666 us: 3
>>
>> 2346799 us: 1
>>
>> 2816159 us: 0
>>
>> 3379391 us: 0
>>
>> 4055269 us: 0
>>
>> 4866323 us: 0
>>
>> 5839588 us: 0
>>
>> 7007506 us: 4
>>
>>
>> Partition Size (bytes)
>>
>>      35 bytes: 31
>>
>>      42 bytes: 0
>>
>>      50 bytes: 0
>>
>>      60 bytes: 0
>>
>>      72 bytes: 0
>>
>>      86 bytes: 0
>>
>>     103 bytes: 761108
>>
>>     124 bytes: 23596164
>>
>>     149 bytes: 22963275
>>
>>     179 bytes: 3517971
>>
>>     215 bytes: 6395335
>>
>>     258 bytes: 6197982
>>
>>     310 bytes: 4662589
>>
>>     372 bytes: 5261934
>>
>>     446 bytes: 3032032
>>
>>     535 bytes: 3775310
>>
>>     642 bytes: 2879935
>>
>>     770 bytes: 3002743
>>
>>     924 bytes: 2279900
>>
>>    1109 bytes: 1976357
>>
>>    1331 bytes: 1643875
>>
>>    1597 bytes: 1422199
>>
>>    1916 bytes: 1142746
>>
>>    2299 bytes: 901584
>>
>>    2759 bytes: 760173
>>
>>    3311 bytes: 596874
>>
>>    3973 bytes: 480502
>>
>>    4768 bytes: 380185
>>
>>    5722 bytes: 303315
>>
>>    6866 bytes: 239013
>>
>>    8239 bytes: 187337
>>
>>    9887 bytes: 147218
>>
>>   11864 bytes: 115684
>>
>>   14237 bytes: 89718
>>
>>   17084 bytes: 69160
>>
>>   20501 bytes: 53070
>>
>>   24601 bytes: 40496
>>
>>   29521 bytes: 30344
>>
>>   35425 bytes: 22656
>>
>>   42510 bytes: 16419
>>
>>   51012 bytes: 11621
>>
>>   61214 bytes: 8020
>>
>>   73457 bytes: 5561
>>
>>   88148 bytes: 3708
>>
>>  105778 bytes: 2434
>>
>>  126934 bytes: 1561
>>
>>  152321 bytes: 1017
>>
>>  182785 bytes: 559
>>
>>  219342 bytes: 349
>>
>>  263210 bytes: 181
>>
>>  315852 bytes: 121
>>
>>  379022 bytes: 66
>>
>>  454826 bytes: 22
>>
>>  545791 bytes: 14
>>
>>  654949 bytes: 10
>>
>>  785939 bytes: 2
>>
>>  943127 bytes: 0
>>
>> 1131752 bytes: 4
>>
>> 1358102 bytes: 2
>>
>> 1629722 bytes: 1
>>
>> 1955666 bytes: 2
>>
>> 2346799 bytes: 0
>>
>> 2816159 bytes: 0
>>
>> 3379391 bytes: 0
>>
>> 4055269 bytes: 0
>>
>> 4866323 bytes: 1
>>
>>
>> Cell Count per Partition
>>
>>     1 cells: 50599458
>>
>>     2 cells: 13975363
>>
>>     3 cells: 7927898
>>
>>     4 cells: 4981142
>>
>>     5 cells: 3081166
>>
>>     6 cells: 3034895
>>
>>     7 cells: 1896720
>>
>>     8 cells: 1508849
>>
>>    10 cells: 2418845
>>
>>    12 cells: 1631289
>>
>>    14 cells: 1179269
>>
>>    17 cells: 1301647
>>
>>    20 cells: 919744
>>
>>    24 cells: 864313
>>
>>    29 cells: 735203
>>
>>    35 cells: 597749
>>
>>    42 cells: 469611
>>
>>    50 cells: 367422
>>
>>    60 cells: 312697
>>
>>    72 cells: 251782
>>
>>    86 cells: 197717
>>
>>   103 cells: 161244
>>
>>   124 cells: 132074
>>
>>   149 cells: 103711
>>
>>   179 cells: 81760
>>
>>   215 cells: 63360
>>
>>   258 cells: 48982
>>
>>   310 cells: 37932
>>
>>   372 cells: 28328
>>
>>   446 cells: 21347
>>
>>   535 cells: 15373
>>
>>   642 cells: 10959
>>
>>   770 cells: 7643
>>
>>   924 cells: 5346
>>
>>  1109 cells: 3613
>>
>>  1331 cells: 2291
>>
>>  1597 cells: 1558
>>
>>  1916 cells: 885
>>
>>  2299 cells: 577
>>
>>  2759 cells: 327
>>
>>  3311 cells: 169
>>
>>  3973 cells: 116
>>
>>  4768 cells: 57
>>
>>  5722 cells: 23
>>
>>  6866 cells: 15
>>
>>  8239 cells: 9
>>
>>  9887 cells: 3
>>
>> 11864 cells: 0
>>
>> 14237 cells: 5
>>
>> 17084 cells: 1
>>
>> 20501 cells: 0
>>
>> 24601 cells: 2
>>
>> 29521 cells: 0
>>
>> 35425 cells: 0
>>
>> 42510 cells: 0
>>
>> 51012 cells: 0
>>
>> 61214 cells: 1
>>
>>
>> CREATE TABLE global_user_events (
>>
>>   user_id timeuuid,
>>
>>   app_id int,
>>
>>   type text,
>>
>>   name text,
>>
>>   extra_param map<text, text>,
>>
>>   paid boolean,
>>
>>   value float,
>>
>>   PRIMARY KEY (user_id, app_id, type, name)
>>
>> ) WITH
>>
>>   bloom_filter_fp_chance=0.100000 AND
>>
>>   caching='KEYS_ONLY' AND
>>
>>   comment='' AND
>>
>>   dclocal_read_repair_chance=0.000000 AND
>>
>>   gc_grace_seconds=86400 AND
>>
>>   index_interval=128 AND
>>
>>   read_repair_chance=0.100000 AND
>>
>>   replicate_on_write='true' AND
>>
>>   populate_io_cache_on_flush='false' AND
>>
>>   default_time_to_live=0 AND
>>
>>   speculative_retry='99.0PERCENTILE' AND
>>
>>   memtable_flush_period_in_ms=0 AND
>>
>>   compaction={'class': 'LeveledCompactionStrategy'} AND
>>
>>   compression={'sstable_compression': 'LZ4Compressor'};
>>
>>
>> users/global_user_events histograms
>>
>>
>> SSTables per Read
>>
>>  1 sstables: 1024446
>>
>>  2 sstables: 632948
>>
>>  3 sstables: 230768
>>
>>  4 sstables: 112428
>>
>>  5 sstables: 82651
>>
>>  6 sstables: 61439
>>
>>  7 sstables: 47518
>>
>>  8 sstables: 38490
>>
>> 10 sstables: 58054
>>
>> 12 sstables: 41128
>>
>> 14 sstables: 29642
>>
>> 17 sstables: 31056
>>
>> 20 sstables: 20474
>>
>> 24 sstables: 17799
>>
>> 29 sstables: 13744
>>
>> 35 sstables: 9485
>>
>> 42 sstables: 5933
>>
>> 50 sstables: 3607
>>
>> 60 sstables: 1618
>>
>> 72 sstables: 263
>>
>> 86 sstables: 10
>>
>>
>> Write Latency (microseconds)
>>
>>      10 us: 10
>>
>>      12 us: 75
>>
>>      14 us: 319
>>
>>      17 us: 3550
>>
>>      20 us: 20617
>>
>>      24 us: 105254
>>
>>      29 us: 311290
>>
>>      35 us: 494404
>>
>>      42 us: 516301
>>
>>      50 us: 534061
>>
>>      60 us: 421438
>>
>>      72 us: 314377
>>
>>      86 us: 375240
>>
>>     103 us: 333402
>>
>>     124 us: 816891
>>
>>     149 us: 667832
>>
>>     179 us: 430659
>>
>>     215 us: 336063
>>
>>     258 us: 380606
>>
>>     310 us: 235592
>>
>>     372 us: 68679
>>
>>     446 us: 14988
>>
>>     535 us: 6169
>>
>>     642 us: 3763
>>
>>     770 us: 2591
>>
>>     924 us: 2270
>>
>>    1109 us: 1985
>>
>>    1331 us: 1744
>>
>>    1597 us: 1564
>>
>>    1916 us: 1489
>>
>>    2299 us: 1326
>>
>>    2759 us: 1114
>>
>>    3311 us: 1024
>>
>>    3973 us: 984
>>
>>    4768 us: 833
>>
>>    5722 us: 962
>>
>>    6866 us: 1076
>>
>>    8239 us: 891
>>
>>    9887 us: 611
>>
>>   11864 us: 489
>>
>>   14237 us: 324
>>
>>   17084 us: 306
>>
>>   20501 us: 222
>>
>>   24601 us: 426
>>
>>   29521 us: 791
>>
>>   35425 us: 820
>>
>>   42510 us: 534
>>
>>   51012 us: 530
>>
>>   61214 us: 946
>>
>>   73457 us: 1122
>>
>>   88148 us: 714
>>
>>  105778 us: 313
>>
>>  126934 us: 127
>>
>>  152321 us: 38
>>
>>  182785 us: 3
>>
>>  219342 us: 2
>>
>>  263210 us: 0
>>
>>  315852 us: 0
>>
>>  379022 us: 0
>>
>>  454826 us: 2
>>
>>  545791 us: 0
>>
>>  654949 us: 3
>>
>>  785939 us: 0
>>
>>  943127 us: 0
>>
>> 1131752 us: 0
>>
>> 1358102 us: 0
>>
>> 1629722 us: 0
>>
>> 1955666 us: 0
>>
>> 2346799 us: 1
>>
>> 2816159 us: 0
>>
>> 3379391 us: 0
>>
>> 4055269 us: 0
>>
>> 4866323 us: 0
>>
>> 5839588 us: 0
>>
>> 7007506 us: 1
>>
>>
>> Read Latency (microseconds)
>>
>>      42 us: 1
>>
>>      50 us: 2
>>
>>      60 us: 2
>>
>>      72 us: 4
>>
>>      86 us: 15
>>
>>     103 us: 23
>>
>>     124 us: 29
>>
>>     149 us: 48
>>
>>     179 us: 67
>>
>>     215 us: 80
>>
>>     258 us: 316
>>
>>     310 us: 1812
>>
>>     372 us: 4999
>>
>>     446 us: 8876
>>
>>     535 us: 14586
>>
>>     642 us: 24417
>>
>>     770 us: 31943
>>
>>     924 us: 37453
>>
>>    1109 us: 77449
>>
>>    1331 us: 139135
>>
>>    1597 us: 177538
>>
>>    1916 us: 169568
>>
>>    2299 us: 177617
>>
>>    2759 us: 202552
>>
>>    3311 us: 198101
>>
>>    3973 us: 179807
>>
>>    4768 us: 159149
>>
>>    5722 us: 142282
>>
>>    6866 us: 126124
>>
>>    8239 us: 111529
>>
>>    9887 us: 97658
>>
>>   11864 us: 84913
>>
>>   14237 us: 70905
>>
>>   17084 us: 54917
>>
>>   20501 us: 40623
>>
>>   24601 us: 29545
>>
>>   29521 us: 21448
>>
>>   35425 us: 15255
>>
>>   42510 us: 11222
>>
>>   51012 us: 8150
>>
>>   61214 us: 7473
>>
>>   73457 us: 10060
>>
>>   88148 us: 10530
>>
>>  105778 us: 7573
>>
>>  126934 us: 4331
>>
>>  152321 us: 2097
>>
>>  182785 us: 836
>>
>>  219342 us: 255
>>
>>  263210 us: 84
>>
>>  315852 us: 28
>>
>>  379022 us: 8
>>
>>  454826 us: 2
>>
>>  545791 us: 0
>>
>>  654949 us: 2
>>
>>  785939 us: 1
>>
>>  943127 us: 0
>>
>> 1131752 us: 0
>>
>> 1358102 us: 0
>>
>> 1629722 us: 2
>>
>> 1955666 us: 6
>>
>> 2346799 us: 18
>>
>> 2816159 us: 0
>>
>> 3379391 us: 0
>>
>> 4055269 us: 0
>>
>> 4866323 us: 0
>>
>> 5839588 us: 0
>>
>> 7007506 us: 34
>>
>>
>> Partition Size (bytes)
>>
>>     35 bytes: 25
>>
>>     42 bytes: 0
>>
>>     50 bytes: 0
>>
>>     60 bytes: 0
>>
>>     72 bytes: 21761
>>
>>     86 bytes: 694466658
>>
>>    103 bytes: 219007195
>>
>>    124 bytes: 7034596
>>
>>    149 bytes: 110269079
>>
>>    179 bytes: 101276811
>>
>>    215 bytes: 38518027
>>
>>    258 bytes: 103242356
>>
>>    310 bytes: 51434914
>>
>>    372 bytes: 38048245
>>
>>    446 bytes: 42558831
>>
>>    535 bytes: 32869176
>>
>>    642 bytes: 17188026
>>
>>    770 bytes: 14750559
>>
>>    924 bytes: 15963256
>>
>>   1109 bytes: 16125324
>>
>>   1331 bytes: 11462272
>>
>>   1597 bytes: 6810452
>>
>>   1916 bytes: 5685739
>>
>>   2299 bytes: 10232324
>>
>>   2759 bytes: 5138522
>>
>>   3311 bytes: 5847998
>>
>>   3973 bytes: 4065752
>>
>>   4768 bytes: 3468263
>>
>>   5722 bytes: 2545431
>>
>>   6866 bytes: 1701913
>>
>>   8239 bytes: 993808
>>
>>   9887 bytes: 495282
>>
>>  11864 bytes: 208644
>>
>>  14237 bytes: 74307
>>
>>  17084 bytes: 22286
>>
>>  20501 bytes: 5474
>>
>>  24601 bytes: 1225
>>
>>  29521 bytes: 210
>>
>>  35425 bytes: 97
>>
>>  42510 bytes: 48
>>
>>  51012 bytes: 21
>>
>>  61214 bytes: 15
>>
>>  73457 bytes: 10
>>
>>  88148 bytes: 3
>>
>> 105778 bytes: 4
>>
>> 126934 bytes: 1
>>
>> 152321 bytes: 2
>>
>> 182785 bytes: 1
>>
>>
>> Cell Count per Partition
>>
>>    1 cells: 914323128
>>
>>    2 cells: 196020828
>>
>>    3 cells: 165502164
>>
>>    4 cells: 71735648
>>
>>    5 cells: 54450466
>>
>>    6 cells: 28995756
>>
>>    7 cells: 9625976
>>
>>    8 cells: 10257788
>>
>>   10 cells: 24004757
>>
>>   12 cells: 13087147
>>
>>   14 cells: 13588347
>>
>>   17 cells: 11976353
>>
>>   20 cells: 6245374
>>
>>   24 cells: 5627821
>>
>>   29 cells: 9480533
>>
>>   35 cells: 6008276
>>
>>   42 cells: 5221357
>>
>>   50 cells: 4355454
>>
>>   60 cells: 3955488
>>
>>   72 cells: 2872578
>>
>>   86 cells: 1914447
>>
>>  103 cells: 1216276
>>
>>  124 cells: 652035
>>
>>  149 cells: 279565
>>
>>  179 cells: 99836
>>
>>  215 cells: 29132
>>
>>  258 cells: 6682
>>
>>  310 cells: 1276
>>
>>  372 cells: 258
>>
>>  446 cells: 96
>>
>>  535 cells: 48
>>
>>  642 cells: 25
>>
>>  770 cells: 9
>>
>>  924 cells: 9
>>
>> 1109 cells: 4
>>
>> 1331 cells: 3
>>
>> 1597 cells: 1
>>
>> 1916 cells: 2
>>
>> From: Jack Krupansky <ja...@basetechnology.com>
>> Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
>> Date: Thursday, July 24, 2014 at 4:07 PM
>>
>> To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
>> Subject: Re: Hot, large row
>>
>> Could it be some “fat columns” (cells with large blob or text values)
>> rather than the cell-count per se? IOW, a “big row” rather than a “wide
>> row”?
>>
>> And, could it be a large partition rather than a large row (many rows in
>> a single partition)? Are clustering columns being used in the primary key?
>>
>> -- Jack Krupansky
>>
>> *From:* DuyHai Doan <do...@gmail.com>
>> *Sent:* Thursday, July 24, 2014 3:53 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Hot, large row
>>
>> Your extract of cfhistograms show that there are no particular "wide
>> rows". The widest has 61214 cells which is big but not that huge to be
>> really a concern.
>>
>> Turning on trace probabilty only tells give you some "hints" about what
>> kind of queries are done, it does not give the exact partition key nor
>> other statement values, especially when you are using prepared statements
>> ...
>>
>>
>> "I am considering reducing read_request_timeout_in_ms: 5000 in
>> cassandra.yaml so that it reduces the impact when this occurs." --> Don't
>> do that, you'll only sweep dust under the carpet. Find the real issue and
>> fix it instead of changing parameter to hide it.
>>
>> One solution would be on client side, to activate some logging to show
>> the CQL3 statements the application is issuing that may overload the
>> server.  I know that's better said than done but I don't have any other
>> idea for the moment
>>
>> -------- Shameless self-promotion ------
>>
>> To support this kind of live prod debugging & investigation that I add a
>> new dynamic query logging feature in Achilles:
>> https://github.com/doanduyhai/Achilles/wiki/Statements-Logging-and-Tracing#dynamic-statements-logging
>>
>> Once you hit the issue, this kind of feature may save your day...
>>
>>
>>
>>
>>
>>
>> On Thu, Jul 24, 2014 at 9:22 PM, Keith Wright <kw...@nanigans.com>
>> wrote:
>>
>>> I can see from cfhistograms that I do have some wide rows (see below).
>>> I set trace probability as you suggested but the output doesn’t appear to
>>> tell me what row was actually read unless I missed something.  I just see
>>> executing prepared statement.   Any ideas how I can find the row in
>>> question?
>>>
>>> I am considering reducing read_request_timeout_in_ms: 5000 in
>>> cassandra.yaml so that it reduces the impact when this occurs.
>>>
>>> Any help in identifying my issue would be GREATLY appreciated
>>>
>>>
>>> Cell Count per Partition
>>>
>>>     1 cells: 50449950
>>>
>>>     2 cells: 14281828
>>>
>>>     3 cells: 8093366
>>>
>>>     4 cells: 5029200
>>>
>>>     5 cells: 3103023
>>>
>>>     6 cells: 3059903
>>>
>>>     7 cells: 1903018
>>>
>>>     8 cells: 1509297
>>>
>>>    10 cells: 2420359
>>>
>>>    12 cells: 1624895
>>>
>>>    14 cells: 1171678
>>>
>>>    17 cells: 1289391
>>>
>>>    20 cells: 909777
>>>
>>>    24 cells: 852081
>>>
>>>    29 cells: 722925
>>>
>>>    35 cells: 587067
>>>
>>>    42 cells: 459473
>>>
>>>    50 cells: 358744
>>>
>>>    60 cells: 304146
>>>
>>>    72 cells: 244682
>>>
>>>    86 cells: 191045
>>>
>>>   103 cells: 155337
>>>
>>>   124 cells: 127061
>>>
>>>   149 cells: 98913
>>>
>>>   179 cells: 77454
>>>
>>>   215 cells: 59849
>>>
>>>   258 cells: 46117
>>>
>>>   310 cells: 35321
>>>
>>>   372 cells: 26319
>>>
>>>   446 cells: 19379
>>>
>>>   535 cells: 13783
>>>
>>>   642 cells: 9993
>>>
>>>   770 cells: 6973
>>>
>>>   924 cells: 4713
>>>
>>> 1109 cells: 3229
>>>
>>> 1331 cells: 2062
>>>
>>> 1597 cells: 1338
>>>
>>> 1916 cells: 773
>>>
>>> 2299 cells: 495
>>>
>>> 2759 cells: 268
>>>
>>> 3311 cells: 150
>>>
>>> 3973 cells: 100
>>>
>>> 4768 cells: 42
>>>
>>> 5722 cells: 24
>>>
>>> 6866 cells: 12
>>>
>>> 8239 cells: 9
>>>
>>> 9887 cells: 3
>>>
>>> 11864 cells: 0
>>>
>>> 14237 cells: 5
>>>
>>> 17084 cells: 1
>>>
>>> 20501 cells: 0
>>>
>>> 24601 cells: 2
>>>
>>> 29521 cells: 0
>>>
>>> 35425 cells: 0
>>>
>>> 42510 cells: 0
>>>
>>> 51012 cells: 0
>>>
>>> 61214 cells: 2
>>>
>>> From: DuyHai Doan <do...@gmail.com>
>>> Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
>>> Date: Thursday, July 24, 2014 at 3:01 PM
>>> To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
>>> Subject: Re: Hot, large row
>>>
>>> "How can I detect wide rows?" -->
>>>
>>> nodetool cfhistograms <keyspace> <suspected column family>
>>>
>>> Look at column "Column count" (last column) and identify a line in this
>>> column with very high value of "Offset". In a well designed application you
>>> should have a gaussian distribution where 80% of your row have a similar
>>> number of columns.
>>>
>>> "Anyone know what debug level I can set so that I can see what reads the
>>> hot node is handling?  " -->
>>>
>>> "nodetool settraceprobability <value>",  where value is a small number
>>> (0.001) on the node where you encounter the issue. Activate the tracing for
>>> a while (5 mins) and deactivate it (value = 0). Then look into
>>> system_traces tables "events" & "sessions". It may help or not since the
>>> tracing is done once every 1000.
>>>
>>> "Any way to get the server to blacklist these wide rows automatically?"
>>> --> No
>>>
>>>
>>> On Thu, Jul 24, 2014 at 8:48 PM, Keith Wright <kw...@nanigans.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>>    We are seeing an issue where basically daily one of our nodes spikes
>>>> in load and is churning in CMS heap pressure.  It appears that reads are
>>>> backing up and my guess is that our application is reading a large row
>>>> repeatedly.  Our write structure can lead itself to wide rows very
>>>> infrequently (<0.001%) and we do our best to detect and delete them but
>>>> obviously we’re missing a case.  Hoping for assistance on the following
>>>> questions:
>>>>
>>>>    - How can I detect wide rows?
>>>>    - Anyone know what debug level I can set so that I can see what
>>>>    reads the hot node is handling?  I’m hoping to see the “bad” row
>>>>    - Any way to get the server to blacklist these wide rows
>>>>    automatically?
>>>>
>>>> We’re using C* 2.0.6 with Vnodes.
>>>>
>>>> Thanks
>>>>
>>>
>>>
>>
>>
>
>

Re: Hot, large row

Posted by Keith Wright <kw...@nanigans.com>.

One last item to add to this thread:  we have consistently experienced this behavior where over time performance degrades (previously we were unable to bootstrap nodes to due long GC pauses from existing nodes).  I believe its due to tombstone build up (as I mentioned previously one of the tables mentioned is showing a droppable tombstone ratio of > 30%).   The sku table is used to hold SKUs that recent users viewed/purchased.  When we write a new SKU we set the TTL to 30 days where the row key is the user id; our read case is to fetch ALL skus the user has seen within the TTL.  Since the user sees SKUs consistently over time, this can result in a row with many columns much of which are likely tombstoned (see CASSANDRA-6654 which I filed for this which shows that C* does not handle this case well).

I guess I’m just surprised that others aren’t using C* for similar usage cases and thus having the same issue?

 I am hoping to upgrade to 2.0.9 which has improvements to remove tombstones.

From: Keith Wright <kw...@nanigans.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Thursday, July 24, 2014 at 4:50 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Cc: Don Jackson <dj...@nanigans.com>>
Subject: Re: Hot, large row

When a node is showing the high CMS issue, io is actually low likely due to the fact that none is going on during CMS GC.  On a node not showing the issue, iostat shows disk usage around 50% (these are SSD) and load hovers around 10 for a dual octo core machine this is fine.

In addition, nodetool compactionstats does not show that we are falling behind in compactions.

So not sure what is going on here.  We are running CentOS 6.5 with java 1.7.0_51.  It does seem like things are getting worse and I’m considering dropping and rebuilding all the tables (as I have the data in Hadoop).  This seems to be a repeated problem for us with Cassandra and now that Aerospike has an open source version, we are very much considering switching.

Thanks again for the help and any insight you might have!


avg-cpu:  %user   %nice %system %iowait  %steal   %idle

          23.43   12.40   11.40    6.20    0.00   46.57


Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util

sda               0.00     0.00  720.50    0.00     5.51     0.00    15.66     1.05    1.54   0.57  41.35

sdc               0.00     0.00 5930.50 2100.50    55.42     8.21    16.23     2.30    0.29   0.06  51.25

sdb               0.00     0.00 6951.50 2052.50    65.82     8.02    16.80     4.31    0.48   0.07  59.60


avg-cpu:  %user   %nice %system %iowait  %steal   %idle

           9.48   14.72    5.60    3.67    0.00   66.52


Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util

sda               0.00     0.00  327.00   19.50     2.55     0.08    15.53     0.04    0.13   0.12   4.25

sdc               0.00     0.00 3974.50 1403.50    36.37     5.48    15.94     0.99    0.18   0.08  45.10

sdb               0.00     0.00 4357.50 1535.00    40.83     6.00    16.28     1.10    0.19   0.08  47.45

From: DuyHai Doan <do...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Thursday, July 24, 2014 at 4:44 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Hot, large row

For global_user_event_skus_v2

1. number of SSTables per read is quite huge. Considering you're using LCS, it means that LCS cannot keep up with write rate and is left behind. AFAIK LCS is using SizeTieredCompaction at L0 to cope with extreme write burst. Your high number of SSTables per read is quite fishy here.

2. Write latency is widespead up to 73.457 millisecs, meaning that your node is getting behind for write in some cases. Most of write are still below 1 millisec but we don't care. What we care here is the large trail of write latency climbing up to 73 millisecs

3. Same remarks for read latency which is worst because the distribution is even "flatter", worst cases going up to 100 ms.

If I were you, I'll check for disk I/O first and maybe CPU usage


On Thu, Jul 24, 2014 at 10:32 PM, Keith Wright <kw...@nanigans.com>> wrote:
Cfhistograms for the tables I believe are most likely the issue are below on the node that most recently presented the issue.  Any ideas?  Note that these tables are LCS and have droppable tombstone ratios of 27% for global_user_event_skus_v2 and 2.7% for the other.  Table definitions also below, you can see that they do use compound keys.

Thanks


CREATE TABLE global_user_event_skus_v2 (

  user_id timeuuid,

  app_id int,

  type text,

  name text,

  sku text,

  creationtime timestamp,

  PRIMARY KEY (user_id, app_id, type, name, sku)

) WITH

  bloom_filter_fp_chance=0.100000 AND

  caching='KEYS_ONLY' AND

  comment='' AND

  dclocal_read_repair_chance=0.000000 AND

  gc_grace_seconds=86400 AND

  index_interval=128 AND

  read_repair_chance=0.100000 AND

  replicate_on_write='true' AND

  populate_io_cache_on_flush='false' AND

  default_time_to_live=2592000 AND

  speculative_retry='99.0PERCENTILE' AND

  memtable_flush_period_in_ms=0 AND

  compaction={'class': 'LeveledCompactionStrategy'} AND

  compression={'sstable_compression': 'LZ4Compressor'};



users/global_user_event_skus_v2 histograms


SSTables per Read

 1 sstables: 1398122

 2 sstables: 231100

 3 sstables: 97334

 4 sstables: 56496

 5 sstables: 43046

 6 sstables: 35471

 7 sstables: 26888

 8 sstables: 21102

10 sstables: 29522

12 sstables: 18314

14 sstables: 11368

17 sstables: 10283

20 sstables: 5766

24 sstables: 3554

29 sstables: 1681

35 sstables: 521

42 sstables: 102

50 sstables: 1


Write Latency (microseconds)

     2 us: 532

     3 us: 10728

     4 us: 49608

     5 us: 173718

     6 us: 404131

     7 us: 697468

     8 us: 981048

    10 us: 2643680

    12 us: 3183727

    14 us: 3150881

    17 us: 3746333

    20 us: 2298481

    24 us: 1647604

    29 us: 1070733

    35 us: 693300

    42 us: 442120

    50 us: 288618

    60 us: 205061

    72 us: 141194

    86 us: 98781

   103 us: 73629

   124 us: 54900

   149 us: 40518

   179 us: 30330

   215 us: 23953

   258 us: 18875

   310 us: 14147

   372 us: 10509

   446 us: 7984

   535 us: 6257

   642 us: 4903

   770 us: 4086

   924 us: 3252

  1109 us: 2676

  1331 us: 2129

  1597 us: 1753

  1916 us: 1488

  2299 us: 1237

  2759 us: 1078

  3311 us: 866

  3973 us: 704

  4768 us: 551

  5722 us: 553

  6866 us: 774

  8239 us: 443

  9887 us: 303

 11864 us: 222

 14237 us: 181

 17084 us: 146

 20501 us: 147

 24601 us: 499

 29521 us: 842

 35425 us: 548

 42510 us: 303

 51012 us: 560

 61214 us: 842

 73457 us: 600

 88148 us: 283

105778 us: 173

126934 us: 81

152321 us: 18

182785 us: 2


Read Latency (microseconds)

     35 us: 2

     42 us: 72

     50 us: 737

     60 us: 3555

     72 us: 6926

     86 us: 8373

    103 us: 15287

    124 us: 30054

    149 us: 46164

    179 us: 67828

    215 us: 86741

    258 us: 92178

    310 us: 82205

    372 us: 69641

    446 us: 64710

    535 us: 58295

    642 us: 55247

    770 us: 57918

    924 us: 56928

   1109 us: 75762

   1331 us: 122306

   1597 us: 141564

   1916 us: 117301

   2299 us: 102293

   2759 us: 98789

   3311 us: 86923

   3973 us: 75304

   4768 us: 64012

   5722 us: 54973

   6866 us: 47221

   8239 us: 41483

   9887 us: 35015

  11864 us: 30249

  14237 us: 23505

  17084 us: 18364

  20501 us: 12992

  24601 us: 8837

  29521 us: 6497

  35425 us: 4840

  42510 us: 3536

  51012 us: 2471

  61214 us: 2784

  73457 us: 3664

  88148 us: 3219

 105778 us: 1980

 126934 us: 1050

 152321 us: 519

 182785 us: 184

 219342 us: 82

 263210 us: 39

 315852 us: 22

 379022 us: 16

 454826 us: 0

 545791 us: 0

 654949 us: 1

 785939 us: 1

 943127 us: 0

1131752 us: 0

1358102 us: 0

1629722 us: 3

1955666 us: 3

2346799 us: 1

2816159 us: 0

3379391 us: 0

4055269 us: 0

4866323 us: 0

5839588 us: 0

7007506 us: 4


Partition Size (bytes)

     35 bytes: 31

     42 bytes: 0

     50 bytes: 0

     60 bytes: 0

     72 bytes: 0

     86 bytes: 0

    103 bytes: 761108

    124 bytes: 23596164

    149 bytes: 22963275

    179 bytes: 3517971

    215 bytes: 6395335

    258 bytes: 6197982

    310 bytes: 4662589

    372 bytes: 5261934

    446 bytes: 3032032

    535 bytes: 3775310

    642 bytes: 2879935

    770 bytes: 3002743

    924 bytes: 2279900

   1109 bytes: 1976357

   1331 bytes: 1643875

   1597 bytes: 1422199

   1916 bytes: 1142746

   2299 bytes: 901584

   2759 bytes: 760173

   3311 bytes: 596874

   3973 bytes: 480502

   4768 bytes: 380185

   5722 bytes: 303315

   6866 bytes: 239013

   8239 bytes: 187337

   9887 bytes: 147218

  11864 bytes: 115684

  14237 bytes: 89718

  17084 bytes: 69160

  20501 bytes: 53070

  24601 bytes: 40496

  29521 bytes: 30344

  35425 bytes: 22656

  42510 bytes: 16419

  51012 bytes: 11621

  61214 bytes: 8020

  73457 bytes: 5561

  88148 bytes: 3708

 105778 bytes: 2434

 126934 bytes: 1561

 152321 bytes: 1017

 182785 bytes: 559

 219342 bytes: 349

 263210 bytes: 181

 315852 bytes: 121

 379022 bytes: 66

 454826 bytes: 22

 545791 bytes: 14

 654949 bytes: 10

 785939 bytes: 2

 943127 bytes: 0

1131752 bytes: 4

1358102 bytes: 2

1629722 bytes: 1

1955666 bytes: 2

2346799 bytes: 0

2816159 bytes: 0

3379391 bytes: 0

4055269 bytes: 0

4866323 bytes: 1


Cell Count per Partition

    1 cells: 50599458

    2 cells: 13975363

    3 cells: 7927898

    4 cells: 4981142

    5 cells: 3081166

    6 cells: 3034895

    7 cells: 1896720

    8 cells: 1508849

   10 cells: 2418845

   12 cells: 1631289

   14 cells: 1179269

   17 cells: 1301647

   20 cells: 919744

   24 cells: 864313

   29 cells: 735203

   35 cells: 597749

   42 cells: 469611

   50 cells: 367422

   60 cells: 312697

   72 cells: 251782

   86 cells: 197717

  103 cells: 161244

  124 cells: 132074

  149 cells: 103711

  179 cells: 81760

  215 cells: 63360

  258 cells: 48982

  310 cells: 37932

  372 cells: 28328

  446 cells: 21347

  535 cells: 15373

  642 cells: 10959

  770 cells: 7643

  924 cells: 5346

 1109 cells: 3613

 1331 cells: 2291

 1597 cells: 1558

 1916 cells: 885

 2299 cells: 577

 2759 cells: 327

 3311 cells: 169

 3973 cells: 116

 4768 cells: 57

 5722 cells: 23

 6866 cells: 15

 8239 cells: 9

 9887 cells: 3

11864 cells: 0

14237 cells: 5

17084 cells: 1

20501 cells: 0

24601 cells: 2

29521 cells: 0

35425 cells: 0

42510 cells: 0

51012 cells: 0

61214 cells: 1


CREATE TABLE global_user_events (

  user_id timeuuid,

  app_id int,

  type text,

  name text,

  extra_param map<text, text>,

  paid boolean,

  value float,

  PRIMARY KEY (user_id, app_id, type, name)

) WITH

  bloom_filter_fp_chance=0.100000 AND

  caching='KEYS_ONLY' AND

  comment='' AND

  dclocal_read_repair_chance=0.000000 AND

  gc_grace_seconds=86400 AND

  index_interval=128 AND

  read_repair_chance=0.100000 AND

  replicate_on_write='true' AND

  populate_io_cache_on_flush='false' AND

  default_time_to_live=0 AND

  speculative_retry='99.0PERCENTILE' AND

  memtable_flush_period_in_ms=0 AND

  compaction={'class': 'LeveledCompactionStrategy'} AND

  compression={'sstable_compression': 'LZ4Compressor'};


users/global_user_events histograms


SSTables per Read

 1 sstables: 1024446

 2 sstables: 632948

 3 sstables: 230768

 4 sstables: 112428

 5 sstables: 82651

 6 sstables: 61439

 7 sstables: 47518

 8 sstables: 38490

10 sstables: 58054

12 sstables: 41128

14 sstables: 29642

17 sstables: 31056

20 sstables: 20474

24 sstables: 17799

29 sstables: 13744

35 sstables: 9485

42 sstables: 5933

50 sstables: 3607

60 sstables: 1618

72 sstables: 263

86 sstables: 10


Write Latency (microseconds)

     10 us: 10

     12 us: 75

     14 us: 319

     17 us: 3550

     20 us: 20617

     24 us: 105254

     29 us: 311290

     35 us: 494404

     42 us: 516301

     50 us: 534061

     60 us: 421438

     72 us: 314377

     86 us: 375240

    103 us: 333402

    124 us: 816891

    149 us: 667832

    179 us: 430659

    215 us: 336063

    258 us: 380606

    310 us: 235592

    372 us: 68679

    446 us: 14988

    535 us: 6169

    642 us: 3763

    770 us: 2591

    924 us: 2270

   1109 us: 1985

   1331 us: 1744

   1597 us: 1564

   1916 us: 1489

   2299 us: 1326

   2759 us: 1114

   3311 us: 1024

   3973 us: 984

   4768 us: 833

   5722 us: 962

   6866 us: 1076

   8239 us: 891

   9887 us: 611

  11864 us: 489

  14237 us: 324

  17084 us: 306

  20501 us: 222

  24601 us: 426

  29521 us: 791

  35425 us: 820

  42510 us: 534

  51012 us: 530

  61214 us: 946

  73457 us: 1122

  88148 us: 714

 105778 us: 313

 126934 us: 127

 152321 us: 38

 182785 us: 3

 219342 us: 2

 263210 us: 0

 315852 us: 0

 379022 us: 0

 454826 us: 2

 545791 us: 0

 654949 us: 3

 785939 us: 0

 943127 us: 0

1131752 us: 0

1358102 us: 0

1629722 us: 0

1955666 us: 0

2346799 us: 1

2816159 us: 0

3379391 us: 0

4055269 us: 0

4866323 us: 0

5839588 us: 0

7007506 us: 1


Read Latency (microseconds)

     42 us: 1

     50 us: 2

     60 us: 2

     72 us: 4

     86 us: 15

    103 us: 23

    124 us: 29

    149 us: 48

    179 us: 67

    215 us: 80

    258 us: 316

    310 us: 1812

    372 us: 4999

    446 us: 8876

    535 us: 14586

    642 us: 24417

    770 us: 31943

    924 us: 37453

   1109 us: 77449

   1331 us: 139135

   1597 us: 177538

   1916 us: 169568

   2299 us: 177617

   2759 us: 202552

   3311 us: 198101

   3973 us: 179807

   4768 us: 159149

   5722 us: 142282

   6866 us: 126124

   8239 us: 111529

   9887 us: 97658

  11864 us: 84913

  14237 us: 70905

  17084 us: 54917

  20501 us: 40623

  24601 us: 29545

  29521 us: 21448

  35425 us: 15255

  42510 us: 11222

  51012 us: 8150

  61214 us: 7473

  73457 us: 10060

  88148 us: 10530

 105778 us: 7573

 126934 us: 4331

 152321 us: 2097

 182785 us: 836

 219342 us: 255

 263210 us: 84

 315852 us: 28

 379022 us: 8

 454826 us: 2

 545791 us: 0

 654949 us: 2

 785939 us: 1

 943127 us: 0

1131752 us: 0

1358102 us: 0

1629722 us: 2

1955666 us: 6

2346799 us: 18

2816159 us: 0

3379391 us: 0

4055269 us: 0

4866323 us: 0

5839588 us: 0

7007506 us: 34


Partition Size (bytes)

    35 bytes: 25

    42 bytes: 0

    50 bytes: 0

    60 bytes: 0

    72 bytes: 21761

    86 bytes: 694466658

   103 bytes: 219007195

   124 bytes: 7034596

   149 bytes: 110269079

   179 bytes: 101276811

   215 bytes: 38518027

   258 bytes: 103242356

   310 bytes: 51434914

   372 bytes: 38048245

   446 bytes: 42558831

   535 bytes: 32869176

   642 bytes: 17188026

   770 bytes: 14750559

   924 bytes: 15963256

  1109 bytes: 16125324

  1331 bytes: 11462272

  1597 bytes: 6810452

  1916 bytes: 5685739

  2299 bytes: 10232324

  2759 bytes: 5138522

  3311 bytes: 5847998

  3973 bytes: 4065752

  4768 bytes: 3468263

  5722 bytes: 2545431

  6866 bytes: 1701913

  8239 bytes: 993808

  9887 bytes: 495282

 11864 bytes: 208644

 14237 bytes: 74307

 17084 bytes: 22286

 20501 bytes: 5474

 24601 bytes: 1225

 29521 bytes: 210

 35425 bytes: 97

 42510 bytes: 48

 51012 bytes: 21

 61214 bytes: 15

 73457 bytes: 10

 88148 bytes: 3

105778 bytes: 4

126934 bytes: 1

152321 bytes: 2

182785 bytes: 1


Cell Count per Partition

   1 cells: 914323128

   2 cells: 196020828

   3 cells: 165502164

   4 cells: 71735648

   5 cells: 54450466

   6 cells: 28995756

   7 cells: 9625976

   8 cells: 10257788

  10 cells: 24004757

  12 cells: 13087147

  14 cells: 13588347

  17 cells: 11976353

  20 cells: 6245374

  24 cells: 5627821

  29 cells: 9480533

  35 cells: 6008276

  42 cells: 5221357

  50 cells: 4355454

  60 cells: 3955488

  72 cells: 2872578

  86 cells: 1914447

 103 cells: 1216276

 124 cells: 652035

 149 cells: 279565

 179 cells: 99836

 215 cells: 29132

 258 cells: 6682

 310 cells: 1276

 372 cells: 258

 446 cells: 96

 535 cells: 48

 642 cells: 25

 770 cells: 9

 924 cells: 9

1109 cells: 4

1331 cells: 3

1597 cells: 1

1916 cells: 2

From: Jack Krupansky <ja...@basetechnology.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Thursday, July 24, 2014 at 4:07 PM

To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Hot, large row

Could it be some “fat columns” (cells with large blob or text values) rather than the cell-count per se? IOW, a “big row” rather than a “wide row”?

And, could it be a large partition rather than a large row (many rows in a single partition)? Are clustering columns being used in the primary key?

-- Jack Krupansky

From: DuyHai Doan<ma...@gmail.com>
Sent: Thursday, July 24, 2014 3:53 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: Hot, large row

Your extract of cfhistograms show that there are no particular "wide rows". The widest has 61214 cells which is big but not that huge to be really a concern.

Turning on trace probabilty only tells give you some "hints" about what kind of queries are done, it does not give the exact partition key nor other statement values, especially when you are using prepared statements ...


"I am considering reducing read_request_timeout_in_ms: 5000 in cassandra.yaml so that it reduces the impact when this occurs." --> Don't do that, you'll only sweep dust under the carpet. Find the real issue and fix it instead of changing parameter to hide it.

One solution would be on client side, to activate some logging to show the CQL3 statements the application is issuing that may overload the server.  I know that's better said than done but I don't have any other idea for the moment

-------- Shameless self-promotion ------

To support this kind of live prod debugging & investigation that I add a new dynamic query logging feature in Achilles: https://github.com/doanduyhai/Achilles/wiki/Statements-Logging-and-Tracing#dynamic-statements-logging

Once you hit the issue, this kind of feature may save your day...






On Thu, Jul 24, 2014 at 9:22 PM, Keith Wright <kw...@nanigans.com>> wrote:
I can see from cfhistograms that I do have some wide rows (see below).  I set trace probability as you suggested but the output doesn’t appear to tell me what row was actually read unless I missed something.  I just see executing prepared statement.   Any ideas how I can find the row in question?

I am considering reducing read_request_timeout_in_ms: 5000 in cassandra.yaml so that it reduces the impact when this occurs.

Any help in identifying my issue would be GREATLY appreciated


Cell Count per Partition

    1 cells: 50449950

    2 cells: 14281828

    3 cells: 8093366

    4 cells: 5029200

    5 cells: 3103023<tel:3103023>

    6 cells: 3059903

    7 cells: 1903018

    8 cells: 1509297

   10 cells: 2420359

   12 cells: 1624895

   14 cells: 1171678

   17 cells: 1289391

   20 cells: 909777

   24 cells: 852081

   29 cells: 722925

   35 cells: 587067

   42 cells: 459473

   50 cells: 358744

   60 cells: 304146

   72 cells: 244682

   86 cells: 191045

  103 cells: 155337

  124 cells: 127061

  149 cells: 98913

  179 cells: 77454

  215 cells: 59849

  258 cells: 46117

  310 cells: 35321

  372 cells: 26319

  446 cells: 19379

  535 cells: 13783

  642 cells: 9993

  770 cells: 6973

  924 cells: 4713

1109 cells: 3229

1331 cells: 2062

1597 cells: 1338

1916 cells: 773

2299 cells: 495

2759 cells: 268

3311 cells: 150

3973 cells: 100

4768 cells: 42

5722 cells: 24

6866 cells: 12

8239 cells: 9

9887 cells: 3

11864 cells: 0

14237 cells: 5

17084 cells: 1

20501 cells: 0

24601 cells: 2

29521 cells: 0

35425 cells: 0

42510 cells: 0

51012 cells: 0

61214 cells: 2


From: DuyHai Doan <do...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Thursday, July 24, 2014 at 3:01 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Hot, large row

"How can I detect wide rows?" -->

nodetool cfhistograms <keyspace> <suspected column family>

Look at column "Column count" (last column) and identify a line in this column with very high value of "Offset". In a well designed application you should have a gaussian distribution where 80% of your row have a similar number of columns.

"Anyone know what debug level I can set so that I can see what reads the hot node is handling?  " -->

"nodetool settraceprobability <value>",  where value is a small number (0.001) on the node where you encounter the issue. Activate the tracing for a while (5 mins) and deactivate it (value = 0). Then look into system_traces tables "events" & "sessions". It may help or not since the tracing is done once every 1000.

"Any way to get the server to blacklist these wide rows automatically?" --> No


On Thu, Jul 24, 2014 at 8:48 PM, Keith Wright <kw...@nanigans.com>> wrote:
Hi all,

   We are seeing an issue where basically daily one of our nodes spikes in load and is churning in CMS heap pressure.  It appears that reads are backing up and my guess is that our application is reading a large row repeatedly.  Our write structure can lead itself to wide rows very infrequently (<0.001%) and we do our best to detect and delete them but obviously we’re missing a case.  Hoping for assistance on the following questions:

 *   How can I detect wide rows?
 *   Anyone know what debug level I can set so that I can see what reads the hot node is handling?  I’m hoping to see the “bad” row
 *   Any way to get the server to blacklist these wide rows automatically?

We’re using C* 2.0.6 with Vnodes.

Thanks

Re: Hot, large row

Posted by Keith Wright <kw...@nanigans.com>.

When a node is showing the high CMS issue, io is actually low likely due to the fact that none is going on during CMS GC.  On a node not showing the issue, iostat shows disk usage around 50% (these are SSD) and load hovers around 10 for a dual octo core machine this is fine.

In addition, nodetool compactionstats does not show that we are falling behind in compactions.

So not sure what is going on here.  We are running CentOS 6.5 with java 1.7.0_51.  It does seem like things are getting worse and I’m considering dropping and rebuilding all the tables (as I have the data in Hadoop).  This seems to be a repeated problem for us with Cassandra and now that Aerospike has an open source version, we are very much considering switching.

Thanks again for the help and any insight you might have!


avg-cpu:  %user   %nice %system %iowait  %steal   %idle

          23.43   12.40   11.40    6.20    0.00   46.57


Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util

sda               0.00     0.00  720.50    0.00     5.51     0.00    15.66     1.05    1.54   0.57  41.35

sdc               0.00     0.00 5930.50 2100.50    55.42     8.21    16.23     2.30    0.29   0.06  51.25

sdb               0.00     0.00 6951.50 2052.50    65.82     8.02    16.80     4.31    0.48   0.07  59.60


avg-cpu:  %user   %nice %system %iowait  %steal   %idle

           9.48   14.72    5.60    3.67    0.00   66.52


Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await  svctm  %util

sda               0.00     0.00  327.00   19.50     2.55     0.08    15.53     0.04    0.13   0.12   4.25

sdc               0.00     0.00 3974.50 1403.50    36.37     5.48    15.94     0.99    0.18   0.08  45.10

sdb               0.00     0.00 4357.50 1535.00    40.83     6.00    16.28     1.10    0.19   0.08  47.45

From: DuyHai Doan <do...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Thursday, July 24, 2014 at 4:44 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Hot, large row

For global_user_event_skus_v2

1. number of SSTables per read is quite huge. Considering you're using LCS, it means that LCS cannot keep up with write rate and is left behind. AFAIK LCS is using SizeTieredCompaction at L0 to cope with extreme write burst. Your high number of SSTables per read is quite fishy here.

2. Write latency is widespead up to 73.457 millisecs, meaning that your node is getting behind for write in some cases. Most of write are still below 1 millisec but we don't care. What we care here is the large trail of write latency climbing up to 73 millisecs

3. Same remarks for read latency which is worst because the distribution is even "flatter", worst cases going up to 100 ms.

If I were you, I'll check for disk I/O first and maybe CPU usage


On Thu, Jul 24, 2014 at 10:32 PM, Keith Wright <kw...@nanigans.com>> wrote:
Cfhistograms for the tables I believe are most likely the issue are below on the node that most recently presented the issue.  Any ideas?  Note that these tables are LCS and have droppable tombstone ratios of 27% for global_user_event_skus_v2 and 2.7% for the other.  Table definitions also below, you can see that they do use compound keys.

Thanks


CREATE TABLE global_user_event_skus_v2 (

  user_id timeuuid,

  app_id int,

  type text,

  name text,

  sku text,

  creationtime timestamp,

  PRIMARY KEY (user_id, app_id, type, name, sku)

) WITH

  bloom_filter_fp_chance=0.100000 AND

  caching='KEYS_ONLY' AND

  comment='' AND

  dclocal_read_repair_chance=0.000000 AND

  gc_grace_seconds=86400 AND

  index_interval=128 AND

  read_repair_chance=0.100000 AND

  replicate_on_write='true' AND

  populate_io_cache_on_flush='false' AND

  default_time_to_live=2592000 AND

  speculative_retry='99.0PERCENTILE' AND

  memtable_flush_period_in_ms=0 AND

  compaction={'class': 'LeveledCompactionStrategy'} AND

  compression={'sstable_compression': 'LZ4Compressor'};



users/global_user_event_skus_v2 histograms


SSTables per Read

 1 sstables: 1398122

 2 sstables: 231100

 3 sstables: 97334

 4 sstables: 56496

 5 sstables: 43046

 6 sstables: 35471

 7 sstables: 26888

 8 sstables: 21102

10 sstables: 29522

12 sstables: 18314

14 sstables: 11368

17 sstables: 10283

20 sstables: 5766

24 sstables: 3554

29 sstables: 1681

35 sstables: 521

42 sstables: 102

50 sstables: 1


Write Latency (microseconds)

     2 us: 532

     3 us: 10728

     4 us: 49608

     5 us: 173718

     6 us: 404131

     7 us: 697468

     8 us: 981048

    10 us: 2643680

    12 us: 3183727

    14 us: 3150881

    17 us: 3746333

    20 us: 2298481

    24 us: 1647604

    29 us: 1070733

    35 us: 693300

    42 us: 442120

    50 us: 288618

    60 us: 205061

    72 us: 141194

    86 us: 98781

   103 us: 73629

   124 us: 54900

   149 us: 40518

   179 us: 30330

   215 us: 23953

   258 us: 18875

   310 us: 14147

   372 us: 10509

   446 us: 7984

   535 us: 6257

   642 us: 4903

   770 us: 4086

   924 us: 3252

  1109 us: 2676

  1331 us: 2129

  1597 us: 1753

  1916 us: 1488

  2299 us: 1237

  2759 us: 1078

  3311 us: 866

  3973 us: 704

  4768 us: 551

  5722 us: 553

  6866 us: 774

  8239 us: 443

  9887 us: 303

 11864 us: 222

 14237 us: 181

 17084 us: 146

 20501 us: 147

 24601 us: 499

 29521 us: 842

 35425 us: 548

 42510 us: 303

 51012 us: 560

 61214 us: 842

 73457 us: 600

 88148 us: 283

105778 us: 173

126934 us: 81

152321 us: 18

182785 us: 2


Read Latency (microseconds)

     35 us: 2

     42 us: 72

     50 us: 737

     60 us: 3555

     72 us: 6926

     86 us: 8373

    103 us: 15287

    124 us: 30054

    149 us: 46164

    179 us: 67828

    215 us: 86741

    258 us: 92178

    310 us: 82205

    372 us: 69641

    446 us: 64710

    535 us: 58295

    642 us: 55247

    770 us: 57918

    924 us: 56928

   1109 us: 75762

   1331 us: 122306

   1597 us: 141564

   1916 us: 117301

   2299 us: 102293

   2759 us: 98789

   3311 us: 86923

   3973 us: 75304

   4768 us: 64012

   5722 us: 54973

   6866 us: 47221

   8239 us: 41483

   9887 us: 35015

  11864 us: 30249

  14237 us: 23505

  17084 us: 18364

  20501 us: 12992

  24601 us: 8837

  29521 us: 6497

  35425 us: 4840

  42510 us: 3536

  51012 us: 2471

  61214 us: 2784

  73457 us: 3664

  88148 us: 3219

 105778 us: 1980

 126934 us: 1050

 152321 us: 519

 182785 us: 184

 219342 us: 82

 263210 us: 39

 315852 us: 22

 379022 us: 16

 454826 us: 0

 545791 us: 0

 654949 us: 1

 785939 us: 1

 943127 us: 0

1131752 us: 0

1358102 us: 0

1629722 us: 3

1955666 us: 3

2346799 us: 1

2816159 us: 0

3379391 us: 0

4055269 us: 0

4866323 us: 0

5839588 us: 0

7007506 us: 4


Partition Size (bytes)

     35 bytes: 31

     42 bytes: 0

     50 bytes: 0

     60 bytes: 0

     72 bytes: 0

     86 bytes: 0

    103 bytes: 761108

    124 bytes: 23596164

    149 bytes: 22963275

    179 bytes: 3517971

    215 bytes: 6395335

    258 bytes: 6197982

    310 bytes: 4662589

    372 bytes: 5261934

    446 bytes: 3032032

    535 bytes: 3775310

    642 bytes: 2879935

    770 bytes: 3002743

    924 bytes: 2279900

   1109 bytes: 1976357

   1331 bytes: 1643875

   1597 bytes: 1422199

   1916 bytes: 1142746

   2299 bytes: 901584

   2759 bytes: 760173

   3311 bytes: 596874

   3973 bytes: 480502

   4768 bytes: 380185

   5722 bytes: 303315

   6866 bytes: 239013

   8239 bytes: 187337

   9887 bytes: 147218

  11864 bytes: 115684

  14237 bytes: 89718

  17084 bytes: 69160

  20501 bytes: 53070

  24601 bytes: 40496

  29521 bytes: 30344

  35425 bytes: 22656

  42510 bytes: 16419

  51012 bytes: 11621

  61214 bytes: 8020

  73457 bytes: 5561

  88148 bytes: 3708

 105778 bytes: 2434

 126934 bytes: 1561

 152321 bytes: 1017

 182785 bytes: 559

 219342 bytes: 349

 263210 bytes: 181

 315852 bytes: 121

 379022 bytes: 66

 454826 bytes: 22

 545791 bytes: 14

 654949 bytes: 10

 785939 bytes: 2

 943127 bytes: 0

1131752 bytes: 4

1358102 bytes: 2

1629722 bytes: 1

1955666 bytes: 2

2346799 bytes: 0

2816159 bytes: 0

3379391 bytes: 0

4055269 bytes: 0

4866323 bytes: 1


Cell Count per Partition

    1 cells: 50599458

    2 cells: 13975363

    3 cells: 7927898

    4 cells: 4981142

    5 cells: 3081166

    6 cells: 3034895

    7 cells: 1896720

    8 cells: 1508849

   10 cells: 2418845

   12 cells: 1631289

   14 cells: 1179269

   17 cells: 1301647

   20 cells: 919744

   24 cells: 864313

   29 cells: 735203

   35 cells: 597749

   42 cells: 469611

   50 cells: 367422

   60 cells: 312697

   72 cells: 251782

   86 cells: 197717

  103 cells: 161244

  124 cells: 132074

  149 cells: 103711

  179 cells: 81760

  215 cells: 63360

  258 cells: 48982

  310 cells: 37932

  372 cells: 28328

  446 cells: 21347

  535 cells: 15373

  642 cells: 10959

  770 cells: 7643

  924 cells: 5346

 1109 cells: 3613

 1331 cells: 2291

 1597 cells: 1558

 1916 cells: 885

 2299 cells: 577

 2759 cells: 327

 3311 cells: 169

 3973 cells: 116

 4768 cells: 57

 5722 cells: 23

 6866 cells: 15

 8239 cells: 9

 9887 cells: 3

11864 cells: 0

14237 cells: 5

17084 cells: 1

20501 cells: 0

24601 cells: 2

29521 cells: 0

35425 cells: 0

42510 cells: 0

51012 cells: 0

61214 cells: 1


CREATE TABLE global_user_events (

  user_id timeuuid,

  app_id int,

  type text,

  name text,

  extra_param map<text, text>,

  paid boolean,

  value float,

  PRIMARY KEY (user_id, app_id, type, name)

) WITH

  bloom_filter_fp_chance=0.100000 AND

  caching='KEYS_ONLY' AND

  comment='' AND

  dclocal_read_repair_chance=0.000000 AND

  gc_grace_seconds=86400 AND

  index_interval=128 AND

  read_repair_chance=0.100000 AND

  replicate_on_write='true' AND

  populate_io_cache_on_flush='false' AND

  default_time_to_live=0 AND

  speculative_retry='99.0PERCENTILE' AND

  memtable_flush_period_in_ms=0 AND

  compaction={'class': 'LeveledCompactionStrategy'} AND

  compression={'sstable_compression': 'LZ4Compressor'};


users/global_user_events histograms


SSTables per Read

 1 sstables: 1024446

 2 sstables: 632948

 3 sstables: 230768

 4 sstables: 112428

 5 sstables: 82651

 6 sstables: 61439

 7 sstables: 47518

 8 sstables: 38490

10 sstables: 58054

12 sstables: 41128

14 sstables: 29642

17 sstables: 31056

20 sstables: 20474

24 sstables: 17799

29 sstables: 13744

35 sstables: 9485

42 sstables: 5933

50 sstables: 3607

60 sstables: 1618

72 sstables: 263

86 sstables: 10


Write Latency (microseconds)

     10 us: 10

     12 us: 75

     14 us: 319

     17 us: 3550

     20 us: 20617

     24 us: 105254

     29 us: 311290

     35 us: 494404

     42 us: 516301

     50 us: 534061

     60 us: 421438

     72 us: 314377

     86 us: 375240

    103 us: 333402

    124 us: 816891

    149 us: 667832

    179 us: 430659

    215 us: 336063

    258 us: 380606

    310 us: 235592

    372 us: 68679

    446 us: 14988

    535 us: 6169

    642 us: 3763

    770 us: 2591

    924 us: 2270

   1109 us: 1985

   1331 us: 1744

   1597 us: 1564

   1916 us: 1489

   2299 us: 1326

   2759 us: 1114

   3311 us: 1024

   3973 us: 984

   4768 us: 833

   5722 us: 962

   6866 us: 1076

   8239 us: 891

   9887 us: 611

  11864 us: 489

  14237 us: 324

  17084 us: 306

  20501 us: 222

  24601 us: 426

  29521 us: 791

  35425 us: 820

  42510 us: 534

  51012 us: 530

  61214 us: 946

  73457 us: 1122

  88148 us: 714

 105778 us: 313

 126934 us: 127

 152321 us: 38

 182785 us: 3

 219342 us: 2

 263210 us: 0

 315852 us: 0

 379022 us: 0

 454826 us: 2

 545791 us: 0

 654949 us: 3

 785939 us: 0

 943127 us: 0

1131752 us: 0

1358102 us: 0

1629722 us: 0

1955666 us: 0

2346799 us: 1

2816159 us: 0

3379391 us: 0

4055269 us: 0

4866323 us: 0

5839588 us: 0

7007506 us: 1


Read Latency (microseconds)

     42 us: 1

     50 us: 2

     60 us: 2

     72 us: 4

     86 us: 15

    103 us: 23

    124 us: 29

    149 us: 48

    179 us: 67

    215 us: 80

    258 us: 316

    310 us: 1812

    372 us: 4999

    446 us: 8876

    535 us: 14586

    642 us: 24417

    770 us: 31943

    924 us: 37453

   1109 us: 77449

   1331 us: 139135

   1597 us: 177538

   1916 us: 169568

   2299 us: 177617

   2759 us: 202552

   3311 us: 198101

   3973 us: 179807

   4768 us: 159149

   5722 us: 142282

   6866 us: 126124

   8239 us: 111529

   9887 us: 97658

  11864 us: 84913

  14237 us: 70905

  17084 us: 54917

  20501 us: 40623

  24601 us: 29545

  29521 us: 21448

  35425 us: 15255

  42510 us: 11222

  51012 us: 8150

  61214 us: 7473

  73457 us: 10060

  88148 us: 10530

 105778 us: 7573

 126934 us: 4331

 152321 us: 2097

 182785 us: 836

 219342 us: 255

 263210 us: 84

 315852 us: 28

 379022 us: 8

 454826 us: 2

 545791 us: 0

 654949 us: 2

 785939 us: 1

 943127 us: 0

1131752 us: 0

1358102 us: 0

1629722 us: 2

1955666 us: 6

2346799 us: 18

2816159 us: 0

3379391 us: 0

4055269 us: 0

4866323 us: 0

5839588 us: 0

7007506 us: 34


Partition Size (bytes)

    35 bytes: 25

    42 bytes: 0

    50 bytes: 0

    60 bytes: 0

    72 bytes: 21761

    86 bytes: 694466658

   103 bytes: 219007195

   124 bytes: 7034596

   149 bytes: 110269079

   179 bytes: 101276811

   215 bytes: 38518027

   258 bytes: 103242356

   310 bytes: 51434914

   372 bytes: 38048245

   446 bytes: 42558831

   535 bytes: 32869176

   642 bytes: 17188026

   770 bytes: 14750559

   924 bytes: 15963256

  1109 bytes: 16125324

  1331 bytes: 11462272

  1597 bytes: 6810452

  1916 bytes: 5685739

  2299 bytes: 10232324

  2759 bytes: 5138522

  3311 bytes: 5847998

  3973 bytes: 4065752

  4768 bytes: 3468263

  5722 bytes: 2545431

  6866 bytes: 1701913

  8239 bytes: 993808

  9887 bytes: 495282

 11864 bytes: 208644

 14237 bytes: 74307

 17084 bytes: 22286

 20501 bytes: 5474

 24601 bytes: 1225

 29521 bytes: 210

 35425 bytes: 97

 42510 bytes: 48

 51012 bytes: 21

 61214 bytes: 15

 73457 bytes: 10

 88148 bytes: 3

105778 bytes: 4

126934 bytes: 1

152321 bytes: 2

182785 bytes: 1


Cell Count per Partition

   1 cells: 914323128

   2 cells: 196020828

   3 cells: 165502164

   4 cells: 71735648

   5 cells: 54450466

   6 cells: 28995756

   7 cells: 9625976

   8 cells: 10257788

  10 cells: 24004757

  12 cells: 13087147

  14 cells: 13588347

  17 cells: 11976353

  20 cells: 6245374

  24 cells: 5627821

  29 cells: 9480533

  35 cells: 6008276

  42 cells: 5221357

  50 cells: 4355454

  60 cells: 3955488

  72 cells: 2872578

  86 cells: 1914447

 103 cells: 1216276

 124 cells: 652035

 149 cells: 279565

 179 cells: 99836

 215 cells: 29132

 258 cells: 6682

 310 cells: 1276

 372 cells: 258

 446 cells: 96

 535 cells: 48

 642 cells: 25

 770 cells: 9

 924 cells: 9

1109 cells: 4

1331 cells: 3

1597 cells: 1

1916 cells: 2

From: Jack Krupansky <ja...@basetechnology.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Thursday, July 24, 2014 at 4:07 PM

To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Hot, large row

Could it be some “fat columns” (cells with large blob or text values) rather than the cell-count per se? IOW, a “big row” rather than a “wide row”?

And, could it be a large partition rather than a large row (many rows in a single partition)? Are clustering columns being used in the primary key?

-- Jack Krupansky

From: DuyHai Doan<ma...@gmail.com>
Sent: Thursday, July 24, 2014 3:53 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: Hot, large row

Your extract of cfhistograms show that there are no particular "wide rows". The widest has 61214 cells which is big but not that huge to be really a concern.

Turning on trace probabilty only tells give you some "hints" about what kind of queries are done, it does not give the exact partition key nor other statement values, especially when you are using prepared statements ...


"I am considering reducing read_request_timeout_in_ms: 5000 in cassandra.yaml so that it reduces the impact when this occurs." --> Don't do that, you'll only sweep dust under the carpet. Find the real issue and fix it instead of changing parameter to hide it.

One solution would be on client side, to activate some logging to show the CQL3 statements the application is issuing that may overload the server.  I know that's better said than done but I don't have any other idea for the moment

-------- Shameless self-promotion ------

To support this kind of live prod debugging & investigation that I add a new dynamic query logging feature in Achilles: https://github.com/doanduyhai/Achilles/wiki/Statements-Logging-and-Tracing#dynamic-statements-logging

Once you hit the issue, this kind of feature may save your day...






On Thu, Jul 24, 2014 at 9:22 PM, Keith Wright <kw...@nanigans.com>> wrote:
I can see from cfhistograms that I do have some wide rows (see below).  I set trace probability as you suggested but the output doesn’t appear to tell me what row was actually read unless I missed something.  I just see executing prepared statement.   Any ideas how I can find the row in question?

I am considering reducing read_request_timeout_in_ms: 5000 in cassandra.yaml so that it reduces the impact when this occurs.

Any help in identifying my issue would be GREATLY appreciated


Cell Count per Partition

    1 cells: 50449950

    2 cells: 14281828

    3 cells: 8093366

    4 cells: 5029200

    5 cells: 3103023<tel:3103023>

    6 cells: 3059903

    7 cells: 1903018

    8 cells: 1509297

   10 cells: 2420359

   12 cells: 1624895

   14 cells: 1171678

   17 cells: 1289391

   20 cells: 909777

   24 cells: 852081

   29 cells: 722925

   35 cells: 587067

   42 cells: 459473

   50 cells: 358744

   60 cells: 304146

   72 cells: 244682

   86 cells: 191045

  103 cells: 155337

  124 cells: 127061

  149 cells: 98913

  179 cells: 77454

  215 cells: 59849

  258 cells: 46117

  310 cells: 35321

  372 cells: 26319

  446 cells: 19379

  535 cells: 13783

  642 cells: 9993

  770 cells: 6973

  924 cells: 4713

1109 cells: 3229

1331 cells: 2062

1597 cells: 1338

1916 cells: 773

2299 cells: 495

2759 cells: 268

3311 cells: 150

3973 cells: 100

4768 cells: 42

5722 cells: 24

6866 cells: 12

8239 cells: 9

9887 cells: 3

11864 cells: 0

14237 cells: 5

17084 cells: 1

20501 cells: 0

24601 cells: 2

29521 cells: 0

35425 cells: 0

42510 cells: 0

51012 cells: 0

61214 cells: 2


From: DuyHai Doan <do...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Thursday, July 24, 2014 at 3:01 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Hot, large row

"How can I detect wide rows?" -->

nodetool cfhistograms <keyspace> <suspected column family>

Look at column "Column count" (last column) and identify a line in this column with very high value of "Offset". In a well designed application you should have a gaussian distribution where 80% of your row have a similar number of columns.

"Anyone know what debug level I can set so that I can see what reads the hot node is handling?  " -->

"nodetool settraceprobability <value>",  where value is a small number (0.001) on the node where you encounter the issue. Activate the tracing for a while (5 mins) and deactivate it (value = 0). Then look into system_traces tables "events" & "sessions". It may help or not since the tracing is done once every 1000.

"Any way to get the server to blacklist these wide rows automatically?" --> No


On Thu, Jul 24, 2014 at 8:48 PM, Keith Wright <kw...@nanigans.com>> wrote:
Hi all,

   We are seeing an issue where basically daily one of our nodes spikes in load and is churning in CMS heap pressure.  It appears that reads are backing up and my guess is that our application is reading a large row repeatedly.  Our write structure can lead itself to wide rows very infrequently (<0.001%) and we do our best to detect and delete them but obviously we’re missing a case.  Hoping for assistance on the following questions:

 *   How can I detect wide rows?
 *   Anyone know what debug level I can set so that I can see what reads the hot node is handling?  I’m hoping to see the “bad” row
 *   Any way to get the server to blacklist these wide rows automatically?

We’re using C* 2.0.6 with Vnodes.

Thanks

Re: Hot, large row

Posted by DuyHai Doan <do...@gmail.com>.

For global_user_event_skus_v2

1. number of SSTables per read is quite huge. Considering you're using LCS,
it means that LCS cannot keep up with write rate and is left behind. AFAIK
LCS is using SizeTieredCompaction at L0 to cope with extreme write burst.
Your high number of SSTables per read is quite fishy here.

2. Write latency is widespead up to 73.457 millisecs, meaning that your
node is getting behind for write in some cases. Most of write are still
below 1 millisec but we don't care. What we care here is the large trail of
write latency climbing up to 73 millisecs

3. Same remarks for read latency which is worst because the distribution is
even "flatter", worst cases going up to 100 ms.

If I were you, I'll check for disk I/O first and maybe CPU usage


On Thu, Jul 24, 2014 at 10:32 PM, Keith Wright <kw...@nanigans.com> wrote:

> Cfhistograms for the tables I believe are most likely the issue are below
> on the node that most recently presented the issue.  Any ideas?  Note that
> these tables are LCS and have droppable tombstone ratios of 27% for
> global_user_event_skus_v2 and 2.7% for the other.  Table definitions also
> below, you can see that they do use compound keys.
>
> Thanks
>
> CREATE TABLE global_user_event_skus_v2 (
>
>   user_id timeuuid,
>
>   app_id int,
>
>   type text,
>
>   name text,
>
>   sku text,
>
>   creationtime timestamp,
>
>   PRIMARY KEY (user_id, app_id, type, name, sku)
>
> ) WITH
>
>   bloom_filter_fp_chance=0.100000 AND
>
>   caching='KEYS_ONLY' AND
>
>   comment='' AND
>
>   dclocal_read_repair_chance=0.000000 AND
>
>   gc_grace_seconds=86400 AND
>
>   index_interval=128 AND
>
>   read_repair_chance=0.100000 AND
>
>   replicate_on_write='true' AND
>
>   populate_io_cache_on_flush='false' AND
>
>   default_time_to_live=2592000 AND
>
>   speculative_retry='99.0PERCENTILE' AND
>
>   memtable_flush_period_in_ms=0 AND
>
>   compaction={'class': 'LeveledCompactionStrategy'} AND
>
>   compression={'sstable_compression': 'LZ4Compressor'};
>
>
> users/global_user_event_skus_v2 histograms
>
>
> SSTables per Read
>
>  1 sstables: 1398122
>
>  2 sstables: 231100
>
>  3 sstables: 97334
>
>  4 sstables: 56496
>
>  5 sstables: 43046
>
>  6 sstables: 35471
>
>  7 sstables: 26888
>
>  8 sstables: 21102
>
> 10 sstables: 29522
>
> 12 sstables: 18314
>
> 14 sstables: 11368
>
> 17 sstables: 10283
>
> 20 sstables: 5766
>
> 24 sstables: 3554
>
> 29 sstables: 1681
>
> 35 sstables: 521
>
> 42 sstables: 102
>
> 50 sstables: 1
>
>
> Write Latency (microseconds)
>
>      2 us: 532
>
>      3 us: 10728
>
>      4 us: 49608
>
>      5 us: 173718
>
>      6 us: 404131
>
>      7 us: 697468
>
>      8 us: 981048
>
>     10 us: 2643680
>
>     12 us: 3183727
>
>     14 us: 3150881
>
>     17 us: 3746333
>
>     20 us: 2298481
>
>     24 us: 1647604
>
>     29 us: 1070733
>
>     35 us: 693300
>
>     42 us: 442120
>
>     50 us: 288618
>
>     60 us: 205061
>
>     72 us: 141194
>
>     86 us: 98781
>
>    103 us: 73629
>
>    124 us: 54900
>
>    149 us: 40518
>
>    179 us: 30330
>
>    215 us: 23953
>
>    258 us: 18875
>
>    310 us: 14147
>
>    372 us: 10509
>
>    446 us: 7984
>
>    535 us: 6257
>
>    642 us: 4903
>
>    770 us: 4086
>
>    924 us: 3252
>
>   1109 us: 2676
>
>   1331 us: 2129
>
>   1597 us: 1753
>
>   1916 us: 1488
>
>   2299 us: 1237
>
>   2759 us: 1078
>
>   3311 us: 866
>
>   3973 us: 704
>
>   4768 us: 551
>
>   5722 us: 553
>
>   6866 us: 774
>
>   8239 us: 443
>
>   9887 us: 303
>
>  11864 us: 222
>
>  14237 us: 181
>
>  17084 us: 146
>
>  20501 us: 147
>
>  24601 us: 499
>
>  29521 us: 842
>
>  35425 us: 548
>
>  42510 us: 303
>
>  51012 us: 560
>
>  61214 us: 842
>
>  73457 us: 600
>
>  88148 us: 283
>
> 105778 us: 173
>
> 126934 us: 81
>
> 152321 us: 18
>
> 182785 us: 2
>
>
> Read Latency (microseconds)
>
>      35 us: 2
>
>      42 us: 72
>
>      50 us: 737
>
>      60 us: 3555
>
>      72 us: 6926
>
>      86 us: 8373
>
>     103 us: 15287
>
>     124 us: 30054
>
>     149 us: 46164
>
>     179 us: 67828
>
>     215 us: 86741
>
>     258 us: 92178
>
>     310 us: 82205
>
>     372 us: 69641
>
>     446 us: 64710
>
>     535 us: 58295
>
>     642 us: 55247
>
>     770 us: 57918
>
>     924 us: 56928
>
>    1109 us: 75762
>
>    1331 us: 122306
>
>    1597 us: 141564
>
>    1916 us: 117301
>
>    2299 us: 102293
>
>    2759 us: 98789
>
>    3311 us: 86923
>
>    3973 us: 75304
>
>    4768 us: 64012
>
>    5722 us: 54973
>
>    6866 us: 47221
>
>    8239 us: 41483
>
>    9887 us: 35015
>
>   11864 us: 30249
>
>   14237 us: 23505
>
>   17084 us: 18364
>
>   20501 us: 12992
>
>   24601 us: 8837
>
>   29521 us: 6497
>
>   35425 us: 4840
>
>   42510 us: 3536
>
>   51012 us: 2471
>
>   61214 us: 2784
>
>   73457 us: 3664
>
>   88148 us: 3219
>
>  105778 us: 1980
>
>  126934 us: 1050
>
>  152321 us: 519
>
>  182785 us: 184
>
>  219342 us: 82
>
>  263210 us: 39
>
>  315852 us: 22
>
>  379022 us: 16
>
>  454826 us: 0
>
>  545791 us: 0
>
>  654949 us: 1
>
>  785939 us: 1
>
>  943127 us: 0
>
> 1131752 us: 0
>
> 1358102 us: 0
>
> 1629722 us: 3
>
> 1955666 us: 3
>
> 2346799 us: 1
>
> 2816159 us: 0
>
> 3379391 us: 0
>
> 4055269 us: 0
>
> 4866323 us: 0
>
> 5839588 us: 0
>
> 7007506 us: 4
>
>
> Partition Size (bytes)
>
>      35 bytes: 31
>
>      42 bytes: 0
>
>      50 bytes: 0
>
>      60 bytes: 0
>
>      72 bytes: 0
>
>      86 bytes: 0
>
>     103 bytes: 761108
>
>     124 bytes: 23596164
>
>     149 bytes: 22963275
>
>     179 bytes: 3517971
>
>     215 bytes: 6395335
>
>     258 bytes: 6197982
>
>     310 bytes: 4662589
>
>     372 bytes: 5261934
>
>     446 bytes: 3032032
>
>     535 bytes: 3775310
>
>     642 bytes: 2879935
>
>     770 bytes: 3002743
>
>     924 bytes: 2279900
>
>    1109 bytes: 1976357
>
>    1331 bytes: 1643875
>
>    1597 bytes: 1422199
>
>    1916 bytes: 1142746
>
>    2299 bytes: 901584
>
>    2759 bytes: 760173
>
>    3311 bytes: 596874
>
>    3973 bytes: 480502
>
>    4768 bytes: 380185
>
>    5722 bytes: 303315
>
>    6866 bytes: 239013
>
>    8239 bytes: 187337
>
>    9887 bytes: 147218
>
>   11864 bytes: 115684
>
>   14237 bytes: 89718
>
>   17084 bytes: 69160
>
>   20501 bytes: 53070
>
>   24601 bytes: 40496
>
>   29521 bytes: 30344
>
>   35425 bytes: 22656
>
>   42510 bytes: 16419
>
>   51012 bytes: 11621
>
>   61214 bytes: 8020
>
>   73457 bytes: 5561
>
>   88148 bytes: 3708
>
>  105778 bytes: 2434
>
>  126934 bytes: 1561
>
>  152321 bytes: 1017
>
>  182785 bytes: 559
>
>  219342 bytes: 349
>
>  263210 bytes: 181
>
>  315852 bytes: 121
>
>  379022 bytes: 66
>
>  454826 bytes: 22
>
>  545791 bytes: 14
>
>  654949 bytes: 10
>
>  785939 bytes: 2
>
>  943127 bytes: 0
>
> 1131752 bytes: 4
>
> 1358102 bytes: 2
>
> 1629722 bytes: 1
>
> 1955666 bytes: 2
>
> 2346799 bytes: 0
>
> 2816159 bytes: 0
>
> 3379391 bytes: 0
>
> 4055269 bytes: 0
>
> 4866323 bytes: 1
>
>
> Cell Count per Partition
>
>     1 cells: 50599458
>
>     2 cells: 13975363
>
>     3 cells: 7927898
>
>     4 cells: 4981142
>
>     5 cells: 3081166
>
>     6 cells: 3034895
>
>     7 cells: 1896720
>
>     8 cells: 1508849
>
>    10 cells: 2418845
>
>    12 cells: 1631289
>
>    14 cells: 1179269
>
>    17 cells: 1301647
>
>    20 cells: 919744
>
>    24 cells: 864313
>
>    29 cells: 735203
>
>    35 cells: 597749
>
>    42 cells: 469611
>
>    50 cells: 367422
>
>    60 cells: 312697
>
>    72 cells: 251782
>
>    86 cells: 197717
>
>   103 cells: 161244
>
>   124 cells: 132074
>
>   149 cells: 103711
>
>   179 cells: 81760
>
>   215 cells: 63360
>
>   258 cells: 48982
>
>   310 cells: 37932
>
>   372 cells: 28328
>
>   446 cells: 21347
>
>   535 cells: 15373
>
>   642 cells: 10959
>
>   770 cells: 7643
>
>   924 cells: 5346
>
>  1109 cells: 3613
>
>  1331 cells: 2291
>
>  1597 cells: 1558
>
>  1916 cells: 885
>
>  2299 cells: 577
>
>  2759 cells: 327
>
>  3311 cells: 169
>
>  3973 cells: 116
>
>  4768 cells: 57
>
>  5722 cells: 23
>
>  6866 cells: 15
>
>  8239 cells: 9
>
>  9887 cells: 3
>
> 11864 cells: 0
>
> 14237 cells: 5
>
> 17084 cells: 1
>
> 20501 cells: 0
>
> 24601 cells: 2
>
> 29521 cells: 0
>
> 35425 cells: 0
>
> 42510 cells: 0
>
> 51012 cells: 0
>
> 61214 cells: 1
>
>
> CREATE TABLE global_user_events (
>
>   user_id timeuuid,
>
>   app_id int,
>
>   type text,
>
>   name text,
>
>   extra_param map<text, text>,
>
>   paid boolean,
>
>   value float,
>
>   PRIMARY KEY (user_id, app_id, type, name)
>
> ) WITH
>
>   bloom_filter_fp_chance=0.100000 AND
>
>   caching='KEYS_ONLY' AND
>
>   comment='' AND
>
>   dclocal_read_repair_chance=0.000000 AND
>
>   gc_grace_seconds=86400 AND
>
>   index_interval=128 AND
>
>   read_repair_chance=0.100000 AND
>
>   replicate_on_write='true' AND
>
>   populate_io_cache_on_flush='false' AND
>
>   default_time_to_live=0 AND
>
>   speculative_retry='99.0PERCENTILE' AND
>
>   memtable_flush_period_in_ms=0 AND
>
>   compaction={'class': 'LeveledCompactionStrategy'} AND
>
>   compression={'sstable_compression': 'LZ4Compressor'};
>
>
> users/global_user_events histograms
>
>
> SSTables per Read
>
>  1 sstables: 1024446
>
>  2 sstables: 632948
>
>  3 sstables: 230768
>
>  4 sstables: 112428
>
>  5 sstables: 82651
>
>  6 sstables: 61439
>
>  7 sstables: 47518
>
>  8 sstables: 38490
>
> 10 sstables: 58054
>
> 12 sstables: 41128
>
> 14 sstables: 29642
>
> 17 sstables: 31056
>
> 20 sstables: 20474
>
> 24 sstables: 17799
>
> 29 sstables: 13744
>
> 35 sstables: 9485
>
> 42 sstables: 5933
>
> 50 sstables: 3607
>
> 60 sstables: 1618
>
> 72 sstables: 263
>
> 86 sstables: 10
>
>
> Write Latency (microseconds)
>
>      10 us: 10
>
>      12 us: 75
>
>      14 us: 319
>
>      17 us: 3550
>
>      20 us: 20617
>
>      24 us: 105254
>
>      29 us: 311290
>
>      35 us: 494404
>
>      42 us: 516301
>
>      50 us: 534061
>
>      60 us: 421438
>
>      72 us: 314377
>
>      86 us: 375240
>
>     103 us: 333402
>
>     124 us: 816891
>
>     149 us: 667832
>
>     179 us: 430659
>
>     215 us: 336063
>
>     258 us: 380606
>
>     310 us: 235592
>
>     372 us: 68679
>
>     446 us: 14988
>
>     535 us: 6169
>
>     642 us: 3763
>
>     770 us: 2591
>
>     924 us: 2270
>
>    1109 us: 1985
>
>    1331 us: 1744
>
>    1597 us: 1564
>
>    1916 us: 1489
>
>    2299 us: 1326
>
>    2759 us: 1114
>
>    3311 us: 1024
>
>    3973 us: 984
>
>    4768 us: 833
>
>    5722 us: 962
>
>    6866 us: 1076
>
>    8239 us: 891
>
>    9887 us: 611
>
>   11864 us: 489
>
>   14237 us: 324
>
>   17084 us: 306
>
>   20501 us: 222
>
>   24601 us: 426
>
>   29521 us: 791
>
>   35425 us: 820
>
>   42510 us: 534
>
>   51012 us: 530
>
>   61214 us: 946
>
>   73457 us: 1122
>
>   88148 us: 714
>
>  105778 us: 313
>
>  126934 us: 127
>
>  152321 us: 38
>
>  182785 us: 3
>
>  219342 us: 2
>
>  263210 us: 0
>
>  315852 us: 0
>
>  379022 us: 0
>
>  454826 us: 2
>
>  545791 us: 0
>
>  654949 us: 3
>
>  785939 us: 0
>
>  943127 us: 0
>
> 1131752 us: 0
>
> 1358102 us: 0
>
> 1629722 us: 0
>
> 1955666 us: 0
>
> 2346799 us: 1
>
> 2816159 us: 0
>
> 3379391 us: 0
>
> 4055269 us: 0
>
> 4866323 us: 0
>
> 5839588 us: 0
>
> 7007506 us: 1
>
>
> Read Latency (microseconds)
>
>      42 us: 1
>
>      50 us: 2
>
>      60 us: 2
>
>      72 us: 4
>
>      86 us: 15
>
>     103 us: 23
>
>     124 us: 29
>
>     149 us: 48
>
>     179 us: 67
>
>     215 us: 80
>
>     258 us: 316
>
>     310 us: 1812
>
>     372 us: 4999
>
>     446 us: 8876
>
>     535 us: 14586
>
>     642 us: 24417
>
>     770 us: 31943
>
>     924 us: 37453
>
>    1109 us: 77449
>
>    1331 us: 139135
>
>    1597 us: 177538
>
>    1916 us: 169568
>
>    2299 us: 177617
>
>    2759 us: 202552
>
>    3311 us: 198101
>
>    3973 us: 179807
>
>    4768 us: 159149
>
>    5722 us: 142282
>
>    6866 us: 126124
>
>    8239 us: 111529
>
>    9887 us: 97658
>
>   11864 us: 84913
>
>   14237 us: 70905
>
>   17084 us: 54917
>
>   20501 us: 40623
>
>   24601 us: 29545
>
>   29521 us: 21448
>
>   35425 us: 15255
>
>   42510 us: 11222
>
>   51012 us: 8150
>
>   61214 us: 7473
>
>   73457 us: 10060
>
>   88148 us: 10530
>
>  105778 us: 7573
>
>  126934 us: 4331
>
>  152321 us: 2097
>
>  182785 us: 836
>
>  219342 us: 255
>
>  263210 us: 84
>
>  315852 us: 28
>
>  379022 us: 8
>
>  454826 us: 2
>
>  545791 us: 0
>
>  654949 us: 2
>
>  785939 us: 1
>
>  943127 us: 0
>
> 1131752 us: 0
>
> 1358102 us: 0
>
> 1629722 us: 2
>
> 1955666 us: 6
>
> 2346799 us: 18
>
> 2816159 us: 0
>
> 3379391 us: 0
>
> 4055269 us: 0
>
> 4866323 us: 0
>
> 5839588 us: 0
>
> 7007506 us: 34
>
>
> Partition Size (bytes)
>
>     35 bytes: 25
>
>     42 bytes: 0
>
>     50 bytes: 0
>
>     60 bytes: 0
>
>     72 bytes: 21761
>
>     86 bytes: 694466658
>
>    103 bytes: 219007195
>
>    124 bytes: 7034596
>
>    149 bytes: 110269079
>
>    179 bytes: 101276811
>
>    215 bytes: 38518027
>
>    258 bytes: 103242356
>
>    310 bytes: 51434914
>
>    372 bytes: 38048245
>
>    446 bytes: 42558831
>
>    535 bytes: 32869176
>
>    642 bytes: 17188026
>
>    770 bytes: 14750559
>
>    924 bytes: 15963256
>
>   1109 bytes: 16125324
>
>   1331 bytes: 11462272
>
>   1597 bytes: 6810452
>
>   1916 bytes: 5685739
>
>   2299 bytes: 10232324
>
>   2759 bytes: 5138522
>
>   3311 bytes: 5847998
>
>   3973 bytes: 4065752
>
>   4768 bytes: 3468263
>
>   5722 bytes: 2545431
>
>   6866 bytes: 1701913
>
>   8239 bytes: 993808
>
>   9887 bytes: 495282
>
>  11864 bytes: 208644
>
>  14237 bytes: 74307
>
>  17084 bytes: 22286
>
>  20501 bytes: 5474
>
>  24601 bytes: 1225
>
>  29521 bytes: 210
>
>  35425 bytes: 97
>
>  42510 bytes: 48
>
>  51012 bytes: 21
>
>  61214 bytes: 15
>
>  73457 bytes: 10
>
>  88148 bytes: 3
>
> 105778 bytes: 4
>
> 126934 bytes: 1
>
> 152321 bytes: 2
>
> 182785 bytes: 1
>
>
> Cell Count per Partition
>
>    1 cells: 914323128
>
>    2 cells: 196020828
>
>    3 cells: 165502164
>
>    4 cells: 71735648
>
>    5 cells: 54450466
>
>    6 cells: 28995756
>
>    7 cells: 9625976
>
>    8 cells: 10257788
>
>   10 cells: 24004757
>
>   12 cells: 13087147
>
>   14 cells: 13588347
>
>   17 cells: 11976353
>
>   20 cells: 6245374
>
>   24 cells: 5627821
>
>   29 cells: 9480533
>
>   35 cells: 6008276
>
>   42 cells: 5221357
>
>   50 cells: 4355454
>
>   60 cells: 3955488
>
>   72 cells: 2872578
>
>   86 cells: 1914447
>
>  103 cells: 1216276
>
>  124 cells: 652035
>
>  149 cells: 279565
>
>  179 cells: 99836
>
>  215 cells: 29132
>
>  258 cells: 6682
>
>  310 cells: 1276
>
>  372 cells: 258
>
>  446 cells: 96
>
>  535 cells: 48
>
>  642 cells: 25
>
>  770 cells: 9
>
>  924 cells: 9
>
> 1109 cells: 4
>
> 1331 cells: 3
>
> 1597 cells: 1
>
> 1916 cells: 2
>
> From: Jack Krupansky <ja...@basetechnology.com>
> Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
> Date: Thursday, July 24, 2014 at 4:07 PM
>
> To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
> Subject: Re: Hot, large row
>
> Could it be some “fat columns” (cells with large blob or text values)
> rather than the cell-count per se? IOW, a “big row” rather than a “wide
> row”?
>
> And, could it be a large partition rather than a large row (many rows in a
> single partition)? Are clustering columns being used in the primary key?
>
> -- Jack Krupansky
>
> *From:* DuyHai Doan <do...@gmail.com>
> *Sent:* Thursday, July 24, 2014 3:53 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Hot, large row
>
> Your extract of cfhistograms show that there are no particular "wide
> rows". The widest has 61214 cells which is big but not that huge to be
> really a concern.
>
> Turning on trace probabilty only tells give you some "hints" about what
> kind of queries are done, it does not give the exact partition key nor
> other statement values, especially when you are using prepared statements
> ...
>
>
> "I am considering reducing read_request_timeout_in_ms: 5000 in
> cassandra.yaml so that it reduces the impact when this occurs." --> Don't
> do that, you'll only sweep dust under the carpet. Find the real issue and
> fix it instead of changing parameter to hide it.
>
> One solution would be on client side, to activate some logging to show the
> CQL3 statements the application is issuing that may overload the server.  I
> know that's better said than done but I don't have any other idea for the
> moment
>
> -------- Shameless self-promotion ------
>
> To support this kind of live prod debugging & investigation that I add a
> new dynamic query logging feature in Achilles:
> https://github.com/doanduyhai/Achilles/wiki/Statements-Logging-and-Tracing#dynamic-statements-logging
>
> Once you hit the issue, this kind of feature may save your day...
>
>
>
>
>
>
> On Thu, Jul 24, 2014 at 9:22 PM, Keith Wright <kw...@nanigans.com>
> wrote:
>
>>  I can see from cfhistograms that I do have some wide rows (see below).
>> I set trace probability as you suggested but the output doesn’t appear to
>> tell me what row was actually read unless I missed something.  I just see
>> executing prepared statement.   Any ideas how I can find the row in
>> question?
>>
>> I am considering reducing read_request_timeout_in_ms: 5000 in
>> cassandra.yaml so that it reduces the impact when this occurs.
>>
>> Any help in identifying my issue would be GREATLY appreciated
>>
>>
>> Cell Count per Partition
>>
>>     1 cells: 50449950
>>
>>     2 cells: 14281828
>>
>>     3 cells: 8093366
>>
>>     4 cells: 5029200
>>
>>     5 cells: 3103023
>>
>>     6 cells: 3059903
>>
>>     7 cells: 1903018
>>
>>     8 cells: 1509297
>>
>>    10 cells: 2420359
>>
>>    12 cells: 1624895
>>
>>    14 cells: 1171678
>>
>>    17 cells: 1289391
>>
>>    20 cells: 909777
>>
>>    24 cells: 852081
>>
>>    29 cells: 722925
>>
>>    35 cells: 587067
>>
>>    42 cells: 459473
>>
>>    50 cells: 358744
>>
>>    60 cells: 304146
>>
>>    72 cells: 244682
>>
>>    86 cells: 191045
>>
>>   103 cells: 155337
>>
>>   124 cells: 127061
>>
>>   149 cells: 98913
>>
>>   179 cells: 77454
>>
>>   215 cells: 59849
>>
>>   258 cells: 46117
>>
>>   310 cells: 35321
>>
>>   372 cells: 26319
>>
>>   446 cells: 19379
>>
>>   535 cells: 13783
>>
>>   642 cells: 9993
>>
>>   770 cells: 6973
>>
>>   924 cells: 4713
>>
>> 1109 cells: 3229
>>
>> 1331 cells: 2062
>>
>> 1597 cells: 1338
>>
>> 1916 cells: 773
>>
>> 2299 cells: 495
>>
>> 2759 cells: 268
>>
>> 3311 cells: 150
>>
>> 3973 cells: 100
>>
>> 4768 cells: 42
>>
>> 5722 cells: 24
>>
>> 6866 cells: 12
>>
>> 8239 cells: 9
>>
>> 9887 cells: 3
>>
>> 11864 cells: 0
>>
>> 14237 cells: 5
>>
>> 17084 cells: 1
>>
>> 20501 cells: 0
>>
>> 24601 cells: 2
>>
>> 29521 cells: 0
>>
>> 35425 cells: 0
>>
>> 42510 cells: 0
>>
>> 51012 cells: 0
>>
>> 61214 cells: 2
>>
>> From: DuyHai Doan <do...@gmail.com>
>> Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
>> Date: Thursday, July 24, 2014 at 3:01 PM
>> To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
>> Subject: Re: Hot, large row
>>
>>  "How can I detect wide rows?" -->
>>
>> nodetool cfhistograms <keyspace> <suspected column family>
>>
>> Look at column "Column count" (last column) and identify a line in this
>> column with very high value of "Offset". In a well designed application you
>> should have a gaussian distribution where 80% of your row have a similar
>> number of columns.
>>
>> "Anyone know what debug level I can set so that I can see what reads the
>> hot node is handling?  " -->
>>
>> "nodetool settraceprobability <value>",  where value is a small number
>> (0.001) on the node where you encounter the issue. Activate the tracing for
>> a while (5 mins) and deactivate it (value = 0). Then look into
>> system_traces tables "events" & "sessions". It may help or not since the
>> tracing is done once every 1000.
>>
>> "Any way to get the server to blacklist these wide rows automatically?"
>> --> No
>>
>>
>> On Thu, Jul 24, 2014 at 8:48 PM, Keith Wright <kw...@nanigans.com>
>> wrote:
>>
>>>  Hi all,
>>>
>>>    We are seeing an issue where basically daily one of our nodes spikes
>>> in load and is churning in CMS heap pressure.  It appears that reads are
>>> backing up and my guess is that our application is reading a large row
>>> repeatedly.  Our write structure can lead itself to wide rows very
>>> infrequently (<0.001%) and we do our best to detect and delete them but
>>> obviously we’re missing a case.  Hoping for assistance on the following
>>> questions:
>>>
>>>    - How can I detect wide rows?
>>>    - Anyone know what debug level I can set so that I can see what
>>>    reads the hot node is handling?  I’m hoping to see the “bad” row
>>>    - Any way to get the server to blacklist these wide rows
>>>    automatically?
>>>
>>> We’re using C* 2.0.6 with Vnodes.
>>>
>>> Thanks
>>>
>>
>>
>
>

Re: Hot, large row

Posted by Keith Wright <kw...@nanigans.com>.

Cfhistograms for the tables I believe are most likely the issue are below on the node that most recently presented the issue.  Any ideas?  Note that these tables are LCS and have droppable tombstone ratios of 27% for global_user_event_skus_v2 and 2.7% for the other.  Table definitions also below, you can see that they do use compound keys.

Thanks


CREATE TABLE global_user_event_skus_v2 (

  user_id timeuuid,

  app_id int,

  type text,

  name text,

  sku text,

  creationtime timestamp,

  PRIMARY KEY (user_id, app_id, type, name, sku)

) WITH

  bloom_filter_fp_chance=0.100000 AND

  caching='KEYS_ONLY' AND

  comment='' AND

  dclocal_read_repair_chance=0.000000 AND

  gc_grace_seconds=86400 AND

  index_interval=128 AND

  read_repair_chance=0.100000 AND

  replicate_on_write='true' AND

  populate_io_cache_on_flush='false' AND

  default_time_to_live=2592000 AND

  speculative_retry='99.0PERCENTILE' AND

  memtable_flush_period_in_ms=0 AND

  compaction={'class': 'LeveledCompactionStrategy'} AND

  compression={'sstable_compression': 'LZ4Compressor'};



users/global_user_event_skus_v2 histograms


SSTables per Read

 1 sstables: 1398122

 2 sstables: 231100

 3 sstables: 97334

 4 sstables: 56496

 5 sstables: 43046

 6 sstables: 35471

 7 sstables: 26888

 8 sstables: 21102

10 sstables: 29522

12 sstables: 18314

14 sstables: 11368

17 sstables: 10283

20 sstables: 5766

24 sstables: 3554

29 sstables: 1681

35 sstables: 521

42 sstables: 102

50 sstables: 1


Write Latency (microseconds)

     2 us: 532

     3 us: 10728

     4 us: 49608

     5 us: 173718

     6 us: 404131

     7 us: 697468

     8 us: 981048

    10 us: 2643680

    12 us: 3183727

    14 us: 3150881

    17 us: 3746333

    20 us: 2298481

    24 us: 1647604

    29 us: 1070733

    35 us: 693300

    42 us: 442120

    50 us: 288618

    60 us: 205061

    72 us: 141194

    86 us: 98781

   103 us: 73629

   124 us: 54900

   149 us: 40518

   179 us: 30330

   215 us: 23953

   258 us: 18875

   310 us: 14147

   372 us: 10509

   446 us: 7984

   535 us: 6257

   642 us: 4903

   770 us: 4086

   924 us: 3252

  1109 us: 2676

  1331 us: 2129

  1597 us: 1753

  1916 us: 1488

  2299 us: 1237

  2759 us: 1078

  3311 us: 866

  3973 us: 704

  4768 us: 551

  5722 us: 553

  6866 us: 774

  8239 us: 443

  9887 us: 303

 11864 us: 222

 14237 us: 181

 17084 us: 146

 20501 us: 147

 24601 us: 499

 29521 us: 842

 35425 us: 548

 42510 us: 303

 51012 us: 560

 61214 us: 842

 73457 us: 600

 88148 us: 283

105778 us: 173

126934 us: 81

152321 us: 18

182785 us: 2


Read Latency (microseconds)

     35 us: 2

     42 us: 72

     50 us: 737

     60 us: 3555

     72 us: 6926

     86 us: 8373

    103 us: 15287

    124 us: 30054

    149 us: 46164

    179 us: 67828

    215 us: 86741

    258 us: 92178

    310 us: 82205

    372 us: 69641

    446 us: 64710

    535 us: 58295

    642 us: 55247

    770 us: 57918

    924 us: 56928

   1109 us: 75762

   1331 us: 122306

   1597 us: 141564

   1916 us: 117301

   2299 us: 102293

   2759 us: 98789

   3311 us: 86923

   3973 us: 75304

   4768 us: 64012

   5722 us: 54973

   6866 us: 47221

   8239 us: 41483

   9887 us: 35015

  11864 us: 30249

  14237 us: 23505

  17084 us: 18364

  20501 us: 12992

  24601 us: 8837

  29521 us: 6497

  35425 us: 4840

  42510 us: 3536

  51012 us: 2471

  61214 us: 2784

  73457 us: 3664

  88148 us: 3219

 105778 us: 1980

 126934 us: 1050

 152321 us: 519

 182785 us: 184

 219342 us: 82

 263210 us: 39

 315852 us: 22

 379022 us: 16

 454826 us: 0

 545791 us: 0

 654949 us: 1

 785939 us: 1

 943127 us: 0

1131752 us: 0

1358102 us: 0

1629722 us: 3

1955666 us: 3

2346799 us: 1

2816159 us: 0

3379391 us: 0

4055269 us: 0

4866323 us: 0

5839588 us: 0

7007506 us: 4


Partition Size (bytes)

     35 bytes: 31

     42 bytes: 0

     50 bytes: 0

     60 bytes: 0

     72 bytes: 0

     86 bytes: 0

    103 bytes: 761108

    124 bytes: 23596164

    149 bytes: 22963275

    179 bytes: 3517971

    215 bytes: 6395335

    258 bytes: 6197982

    310 bytes: 4662589

    372 bytes: 5261934

    446 bytes: 3032032

    535 bytes: 3775310

    642 bytes: 2879935

    770 bytes: 3002743

    924 bytes: 2279900

   1109 bytes: 1976357

   1331 bytes: 1643875

   1597 bytes: 1422199

   1916 bytes: 1142746

   2299 bytes: 901584

   2759 bytes: 760173

   3311 bytes: 596874

   3973 bytes: 480502

   4768 bytes: 380185

   5722 bytes: 303315

   6866 bytes: 239013

   8239 bytes: 187337

   9887 bytes: 147218

  11864 bytes: 115684

  14237 bytes: 89718

  17084 bytes: 69160

  20501 bytes: 53070

  24601 bytes: 40496

  29521 bytes: 30344

  35425 bytes: 22656

  42510 bytes: 16419

  51012 bytes: 11621

  61214 bytes: 8020

  73457 bytes: 5561

  88148 bytes: 3708

 105778 bytes: 2434

 126934 bytes: 1561

 152321 bytes: 1017

 182785 bytes: 559

 219342 bytes: 349

 263210 bytes: 181

 315852 bytes: 121

 379022 bytes: 66

 454826 bytes: 22

 545791 bytes: 14

 654949 bytes: 10

 785939 bytes: 2

 943127 bytes: 0

1131752 bytes: 4

1358102 bytes: 2

1629722 bytes: 1

1955666 bytes: 2

2346799 bytes: 0

2816159 bytes: 0

3379391 bytes: 0

4055269 bytes: 0

4866323 bytes: 1


Cell Count per Partition

    1 cells: 50599458

    2 cells: 13975363

    3 cells: 7927898

    4 cells: 4981142

    5 cells: 3081166

    6 cells: 3034895

    7 cells: 1896720

    8 cells: 1508849

   10 cells: 2418845

   12 cells: 1631289

   14 cells: 1179269

   17 cells: 1301647

   20 cells: 919744

   24 cells: 864313

   29 cells: 735203

   35 cells: 597749

   42 cells: 469611

   50 cells: 367422

   60 cells: 312697

   72 cells: 251782

   86 cells: 197717

  103 cells: 161244

  124 cells: 132074

  149 cells: 103711

  179 cells: 81760

  215 cells: 63360

  258 cells: 48982

  310 cells: 37932

  372 cells: 28328

  446 cells: 21347

  535 cells: 15373

  642 cells: 10959

  770 cells: 7643

  924 cells: 5346

 1109 cells: 3613

 1331 cells: 2291

 1597 cells: 1558

 1916 cells: 885

 2299 cells: 577

 2759 cells: 327

 3311 cells: 169

 3973 cells: 116

 4768 cells: 57

 5722 cells: 23

 6866 cells: 15

 8239 cells: 9

 9887 cells: 3

11864 cells: 0

14237 cells: 5

17084 cells: 1

20501 cells: 0

24601 cells: 2

29521 cells: 0

35425 cells: 0

42510 cells: 0

51012 cells: 0

61214 cells: 1


CREATE TABLE global_user_events (

  user_id timeuuid,

  app_id int,

  type text,

  name text,

  extra_param map<text, text>,

  paid boolean,

  value float,

  PRIMARY KEY (user_id, app_id, type, name)

) WITH

  bloom_filter_fp_chance=0.100000 AND

  caching='KEYS_ONLY' AND

  comment='' AND

  dclocal_read_repair_chance=0.000000 AND

  gc_grace_seconds=86400 AND

  index_interval=128 AND

  read_repair_chance=0.100000 AND

  replicate_on_write='true' AND

  populate_io_cache_on_flush='false' AND

  default_time_to_live=0 AND

  speculative_retry='99.0PERCENTILE' AND

  memtable_flush_period_in_ms=0 AND

  compaction={'class': 'LeveledCompactionStrategy'} AND

  compression={'sstable_compression': 'LZ4Compressor'};


users/global_user_events histograms


SSTables per Read

 1 sstables: 1024446

 2 sstables: 632948

 3 sstables: 230768

 4 sstables: 112428

 5 sstables: 82651

 6 sstables: 61439

 7 sstables: 47518

 8 sstables: 38490

10 sstables: 58054

12 sstables: 41128

14 sstables: 29642

17 sstables: 31056

20 sstables: 20474

24 sstables: 17799

29 sstables: 13744

35 sstables: 9485

42 sstables: 5933

50 sstables: 3607

60 sstables: 1618

72 sstables: 263

86 sstables: 10


Write Latency (microseconds)

     10 us: 10

     12 us: 75

     14 us: 319

     17 us: 3550

     20 us: 20617

     24 us: 105254

     29 us: 311290

     35 us: 494404

     42 us: 516301

     50 us: 534061

     60 us: 421438

     72 us: 314377

     86 us: 375240

    103 us: 333402

    124 us: 816891

    149 us: 667832

    179 us: 430659

    215 us: 336063

    258 us: 380606

    310 us: 235592

    372 us: 68679

    446 us: 14988

    535 us: 6169

    642 us: 3763

    770 us: 2591

    924 us: 2270

   1109 us: 1985

   1331 us: 1744

   1597 us: 1564

   1916 us: 1489

   2299 us: 1326

   2759 us: 1114

   3311 us: 1024

   3973 us: 984

   4768 us: 833

   5722 us: 962

   6866 us: 1076

   8239 us: 891

   9887 us: 611

  11864 us: 489

  14237 us: 324

  17084 us: 306

  20501 us: 222

  24601 us: 426

  29521 us: 791

  35425 us: 820

  42510 us: 534

  51012 us: 530

  61214 us: 946

  73457 us: 1122

  88148 us: 714

 105778 us: 313

 126934 us: 127

 152321 us: 38

 182785 us: 3

 219342 us: 2

 263210 us: 0

 315852 us: 0

 379022 us: 0

 454826 us: 2

 545791 us: 0

 654949 us: 3

 785939 us: 0

 943127 us: 0

1131752 us: 0

1358102 us: 0

1629722 us: 0

1955666 us: 0

2346799 us: 1

2816159 us: 0

3379391 us: 0

4055269 us: 0

4866323 us: 0

5839588 us: 0

7007506 us: 1


Read Latency (microseconds)

     42 us: 1

     50 us: 2

     60 us: 2

     72 us: 4

     86 us: 15

    103 us: 23

    124 us: 29

    149 us: 48

    179 us: 67

    215 us: 80

    258 us: 316

    310 us: 1812

    372 us: 4999

    446 us: 8876

    535 us: 14586

    642 us: 24417

    770 us: 31943

    924 us: 37453

   1109 us: 77449

   1331 us: 139135

   1597 us: 177538

   1916 us: 169568

   2299 us: 177617

   2759 us: 202552

   3311 us: 198101

   3973 us: 179807

   4768 us: 159149

   5722 us: 142282

   6866 us: 126124

   8239 us: 111529

   9887 us: 97658

  11864 us: 84913

  14237 us: 70905

  17084 us: 54917

  20501 us: 40623

  24601 us: 29545

  29521 us: 21448

  35425 us: 15255

  42510 us: 11222

  51012 us: 8150

  61214 us: 7473

  73457 us: 10060

  88148 us: 10530

 105778 us: 7573

 126934 us: 4331

 152321 us: 2097

 182785 us: 836

 219342 us: 255

 263210 us: 84

 315852 us: 28

 379022 us: 8

 454826 us: 2

 545791 us: 0

 654949 us: 2

 785939 us: 1

 943127 us: 0

1131752 us: 0

1358102 us: 0

1629722 us: 2

1955666 us: 6

2346799 us: 18

2816159 us: 0

3379391 us: 0

4055269 us: 0

4866323 us: 0

5839588 us: 0

7007506 us: 34


Partition Size (bytes)

    35 bytes: 25

    42 bytes: 0

    50 bytes: 0

    60 bytes: 0

    72 bytes: 21761

    86 bytes: 694466658

   103 bytes: 219007195

   124 bytes: 7034596

   149 bytes: 110269079

   179 bytes: 101276811

   215 bytes: 38518027

   258 bytes: 103242356

   310 bytes: 51434914

   372 bytes: 38048245

   446 bytes: 42558831

   535 bytes: 32869176

   642 bytes: 17188026

   770 bytes: 14750559

   924 bytes: 15963256

  1109 bytes: 16125324

  1331 bytes: 11462272

  1597 bytes: 6810452

  1916 bytes: 5685739

  2299 bytes: 10232324

  2759 bytes: 5138522

  3311 bytes: 5847998

  3973 bytes: 4065752

  4768 bytes: 3468263

  5722 bytes: 2545431

  6866 bytes: 1701913

  8239 bytes: 993808

  9887 bytes: 495282

 11864 bytes: 208644

 14237 bytes: 74307

 17084 bytes: 22286

 20501 bytes: 5474

 24601 bytes: 1225

 29521 bytes: 210

 35425 bytes: 97

 42510 bytes: 48

 51012 bytes: 21

 61214 bytes: 15

 73457 bytes: 10

 88148 bytes: 3

105778 bytes: 4

126934 bytes: 1

152321 bytes: 2

182785 bytes: 1


Cell Count per Partition

   1 cells: 914323128

   2 cells: 196020828

   3 cells: 165502164

   4 cells: 71735648

   5 cells: 54450466

   6 cells: 28995756

   7 cells: 9625976

   8 cells: 10257788

  10 cells: 24004757

  12 cells: 13087147

  14 cells: 13588347

  17 cells: 11976353

  20 cells: 6245374

  24 cells: 5627821

  29 cells: 9480533

  35 cells: 6008276

  42 cells: 5221357

  50 cells: 4355454

  60 cells: 3955488

  72 cells: 2872578

  86 cells: 1914447

 103 cells: 1216276

 124 cells: 652035

 149 cells: 279565

 179 cells: 99836

 215 cells: 29132

 258 cells: 6682

 310 cells: 1276

 372 cells: 258

 446 cells: 96

 535 cells: 48

 642 cells: 25

 770 cells: 9

 924 cells: 9

1109 cells: 4

1331 cells: 3

1597 cells: 1

1916 cells: 2

From: Jack Krupansky <ja...@basetechnology.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Thursday, July 24, 2014 at 4:07 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Hot, large row

Could it be some “fat columns” (cells with large blob or text values) rather than the cell-count per se? IOW, a “big row” rather than a “wide row”?

And, could it be a large partition rather than a large row (many rows in a single partition)? Are clustering columns being used in the primary key?

-- Jack Krupansky

From: DuyHai Doan<ma...@gmail.com>
Sent: Thursday, July 24, 2014 3:53 PM
To: user@cassandra.apache.org<ma...@cassandra.apache.org>
Subject: Re: Hot, large row

Your extract of cfhistograms show that there are no particular "wide rows". The widest has 61214 cells which is big but not that huge to be really a concern.

Turning on trace probabilty only tells give you some "hints" about what kind of queries are done, it does not give the exact partition key nor other statement values, especially when you are using prepared statements ...


"I am considering reducing read_request_timeout_in_ms: 5000 in cassandra.yaml so that it reduces the impact when this occurs." --> Don't do that, you'll only sweep dust under the carpet. Find the real issue and fix it instead of changing parameter to hide it.

One solution would be on client side, to activate some logging to show the CQL3 statements the application is issuing that may overload the server.  I know that's better said than done but I don't have any other idea for the moment

-------- Shameless self-promotion ------

To support this kind of live prod debugging & investigation that I add a new dynamic query logging feature in Achilles: https://github.com/doanduyhai/Achilles/wiki/Statements-Logging-and-Tracing#dynamic-statements-logging

Once you hit the issue, this kind of feature may save your day...






On Thu, Jul 24, 2014 at 9:22 PM, Keith Wright <kw...@nanigans.com>> wrote:
I can see from cfhistograms that I do have some wide rows (see below).  I set trace probability as you suggested but the output doesn’t appear to tell me what row was actually read unless I missed something.  I just see executing prepared statement.   Any ideas how I can find the row in question?

I am considering reducing read_request_timeout_in_ms: 5000 in cassandra.yaml so that it reduces the impact when this occurs.

Any help in identifying my issue would be GREATLY appreciated


Cell Count per Partition

    1 cells: 50449950

    2 cells: 14281828

    3 cells: 8093366

    4 cells: 5029200

    5 cells: 3103023<tel:3103023>

    6 cells: 3059903

    7 cells: 1903018

    8 cells: 1509297

   10 cells: 2420359

   12 cells: 1624895

   14 cells: 1171678

   17 cells: 1289391

   20 cells: 909777

   24 cells: 852081

   29 cells: 722925

   35 cells: 587067

   42 cells: 459473

   50 cells: 358744

   60 cells: 304146

   72 cells: 244682

   86 cells: 191045

  103 cells: 155337

  124 cells: 127061

  149 cells: 98913

  179 cells: 77454

  215 cells: 59849

  258 cells: 46117

  310 cells: 35321

  372 cells: 26319

  446 cells: 19379

  535 cells: 13783

  642 cells: 9993

  770 cells: 6973

  924 cells: 4713

1109 cells: 3229

1331 cells: 2062

1597 cells: 1338

1916 cells: 773

2299 cells: 495

2759 cells: 268

3311 cells: 150

3973 cells: 100

4768 cells: 42

5722 cells: 24

6866 cells: 12

8239 cells: 9

9887 cells: 3

11864 cells: 0

14237 cells: 5

17084 cells: 1

20501 cells: 0

24601 cells: 2

29521 cells: 0

35425 cells: 0

42510 cells: 0

51012 cells: 0

61214 cells: 2


From: DuyHai Doan <do...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Thursday, July 24, 2014 at 3:01 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Hot, large row

"How can I detect wide rows?" -->

nodetool cfhistograms <keyspace> <suspected column family>

Look at column "Column count" (last column) and identify a line in this column with very high value of "Offset". In a well designed application you should have a gaussian distribution where 80% of your row have a similar number of columns.

"Anyone know what debug level I can set so that I can see what reads the hot node is handling?  " -->

"nodetool settraceprobability <value>",  where value is a small number (0.001) on the node where you encounter the issue. Activate the tracing for a while (5 mins) and deactivate it (value = 0). Then look into system_traces tables "events" & "sessions". It may help or not since the tracing is done once every 1000.

"Any way to get the server to blacklist these wide rows automatically?" --> No


On Thu, Jul 24, 2014 at 8:48 PM, Keith Wright <kw...@nanigans.com>> wrote:
Hi all,

   We are seeing an issue where basically daily one of our nodes spikes in load and is churning in CMS heap pressure.  It appears that reads are backing up and my guess is that our application is reading a large row repeatedly.  Our write structure can lead itself to wide rows very infrequently (<0.001%) and we do our best to detect and delete them but obviously we’re missing a case.  Hoping for assistance on the following questions:

 *   How can I detect wide rows?
 *   Anyone know what debug level I can set so that I can see what reads the hot node is handling?  I’m hoping to see the “bad” row
 *   Any way to get the server to blacklist these wide rows automatically?

We’re using C* 2.0.6 with Vnodes.

Thanks

Re: Hot, large row

Posted by Jack Krupansky <ja...@basetechnology.com>.

Could it be some “fat columns” (cells with large blob or text values) rather than the cell-count per se? IOW, a “big row” rather than a “wide row”?

And, could it be a large partition rather than a large row (many rows in a single partition)? Are clustering columns being used in the primary key?

-- Jack Krupansky

From: DuyHai Doan 
Sent: Thursday, July 24, 2014 3:53 PM
To: user@cassandra.apache.org 
Subject: Re: Hot, large row

Your extract of cfhistograms show that there are no particular "wide rows". The widest has 61214 cells which is big but not that huge to be really a concern.

Turning on trace probabilty only tells give you some "hints" about what kind of queries are done, it does not give the exact partition key nor other statement values, especially when you are using prepared statements ...

"I am considering reducing read_request_timeout_in_ms: 5000 in cassandra.yaml so that it reduces the impact when this occurs." --> Don't do that, you'll only sweep dust under the carpet. Find the real issue and fix it instead of changing parameter to hide it.

One solution would be on client side, to activate some logging to show the CQL3 statements the application is issuing that may overload the server.  I know that's better said than done but I don't have any other idea for the moment

-------- Shameless self-promotion ------

To support this kind of live prod debugging & investigation that I add a new dynamic query logging feature in Achilles: https://github.com/doanduyhai/Achilles/wiki/Statements-Logging-and-Tracing#dynamic-statements-logging

Once you hit the issue, this kind of feature may save your day...

On Thu, Jul 24, 2014 at 9:22 PM, Keith Wright <kw...@nanigans.com> wrote:

  I can see from cfhistograms that I do have some wide rows (see below).  I set trace probability as you suggested but the output doesn’t appear to tell me what row was actually read unless I missed something.  I just see executing prepared statement.   Any ideas how I can find the row in question?

  I am considering reducing read_request_timeout_in_ms: 5000 in cassandra.yaml so that it reduces the impact when this occurs.

  Any help in identifying my issue would be GREATLY appreciated

  Cell Count per Partition

      1 cells: 50449950

      2 cells: 14281828

      3 cells: 8093366

      4 cells: 5029200

      5 cells: 3103023

      6 cells: 3059903

      7 cells: 1903018

      8 cells: 1509297

     10 cells: 2420359

     12 cells: 1624895

     14 cells: 1171678

     17 cells: 1289391

     20 cells: 909777

     24 cells: 852081

     29 cells: 722925

     35 cells: 587067

     42 cells: 459473

     50 cells: 358744

     60 cells: 304146

     72 cells: 244682

     86 cells: 191045

    103 cells: 155337

    124 cells: 127061

    149 cells: 98913

    179 cells: 77454

    215 cells: 59849

    258 cells: 46117

    310 cells: 35321

    372 cells: 26319

    446 cells: 19379

    535 cells: 13783

    642 cells: 9993

    770 cells: 6973

    924 cells: 4713

  1109 cells: 3229

  1331 cells: 2062

  1597 cells: 1338

  1916 cells: 773

  2299 cells: 495

  2759 cells: 268

  3311 cells: 150

  3973 cells: 100

  4768 cells: 42

  5722 cells: 24

  6866 cells: 12

  8239 cells: 9

  9887 cells: 3

  11864 cells: 0

  14237 cells: 5

  17084 cells: 1

  20501 cells: 0

  24601 cells: 2

  29521 cells: 0

  35425 cells: 0

  42510 cells: 0

  51012 cells: 0

  61214 cells: 2

  From: DuyHai Doan <do...@gmail.com>
  Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
  Date: Thursday, July 24, 2014 at 3:01 PM
  To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
  Subject: Re: Hot, large row

  "How can I detect wide rows?" -->

  nodetool cfhistograms <keyspace> <suspected column family>

  Look at column "Column count" (last column) and identify a line in this column with very high value of "Offset". In a well designed application you should have a gaussian distribution where 80% of your row have a similar number of columns.

  "Anyone know what debug level I can set so that I can see what reads the hot node is handling?  " -->

  "nodetool settraceprobability <value>",  where value is a small number (0.001) on the node where you encounter the issue. Activate the tracing for a while (5 mins) and deactivate it (value = 0). Then look into system_traces tables "events" & "sessions". It may help or not since the tracing is done once every 1000. 

  "Any way to get the server to blacklist these wide rows automatically?" --> No

  On Thu, Jul 24, 2014 at 8:48 PM, Keith Wright <kw...@nanigans.com> wrote:

    Hi all,

       We are seeing an issue where basically daily one of our nodes spikes in load and is churning in CMS heap pressure.  It appears that reads are backing up and my guess is that our application is reading a large row repeatedly.  Our write structure can lead itself to wide rows very infrequently (<0.001%) and we do our best to detect and delete them but obviously we’re missing a case.  Hoping for assistance on the following questions:
      a.. How can I detect wide rows? 
      b.. Anyone know what debug level I can set so that I can see what reads the hot node is handling?  I’m hoping to see the “bad” row 
      c.. Any way to get the server to blacklist these wide rows automatically? 
    We’re using C* 2.0.6 with Vnodes.

    Thanks

Re: Hot, large row

Posted by Keith Wright <kw...@nanigans.com>.

I appreciate the feedback, doing it on the client side is interesting and I will start looking into that.  To be clear, here are the symptoms I am seeing:


 *   A node will start showing high load and the CMS collection time jumps to 100+ ms/sec (per new is also up)
 *   If I run nodetool tpstats, I see a high number of items in the Pending phase for ReadStage.  Other items mostly appear near empty.  In addition, I see dropped reads.
 *   If I shutdown the effected node, two other nodes show increased load.  I am at RF=3 so this would again suggest slow reads against a single row

Does it sound correct that my best course of action is to investigate a large row?  If it were a small row being called repeatedly, I assume OS/key cache would make that a VERY fast operation.

Thanks

From: DuyHai Doan <do...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Thursday, July 24, 2014 at 3:53 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Hot, large row

Your extract of cfhistograms show that there are no particular "wide rows". The widest has 61214 cells which is big but not that huge to be really a concern.

Turning on trace probabilty only tells give you some "hints" about what kind of queries are done, it does not give the exact partition key nor other statement values, especially when you are using prepared statements ...


"I am considering reducing read_request_timeout_in_ms: 5000 in cassandra.yaml so that it reduces the impact when this occurs." --> Don't do that, you'll only sweep dust under the carpet. Find the real issue and fix it instead of changing parameter to hide it.

One solution would be on client side, to activate some logging to show the CQL3 statements the application is issuing that may overload the server.  I know that's better said than done but I don't have any other idea for the moment

-------- Shameless self-promotion ------

To support this kind of live prod debugging & investigation that I add a new dynamic query logging feature in Achilles: https://github.com/doanduyhai/Achilles/wiki/Statements-Logging-and-Tracing#dynamic-statements-logging

Once you hit the issue, this kind of feature may save your day...






On Thu, Jul 24, 2014 at 9:22 PM, Keith Wright <kw...@nanigans.com>> wrote:
I can see from cfhistograms that I do have some wide rows (see below).  I set trace probability as you suggested but the output doesn’t appear to tell me what row was actually read unless I missed something.  I just see executing prepared statement.   Any ideas how I can find the row in question?

I am considering reducing read_request_timeout_in_ms: 5000 in cassandra.yaml so that it reduces the impact when this occurs.

Any help in identifying my issue would be GREATLY appreciated


Cell Count per Partition

    1 cells: 50449950

    2 cells: 14281828

    3 cells: 8093366

    4 cells: 5029200

    5 cells: 3103023<tel:3103023>

    6 cells: 3059903

    7 cells: 1903018

    8 cells: 1509297

   10 cells: 2420359

   12 cells: 1624895

   14 cells: 1171678

   17 cells: 1289391

   20 cells: 909777

   24 cells: 852081

   29 cells: 722925

   35 cells: 587067

   42 cells: 459473

   50 cells: 358744

   60 cells: 304146

   72 cells: 244682

   86 cells: 191045

  103 cells: 155337

  124 cells: 127061

  149 cells: 98913

  179 cells: 77454

  215 cells: 59849

  258 cells: 46117

  310 cells: 35321

  372 cells: 26319

  446 cells: 19379

  535 cells: 13783

  642 cells: 9993

  770 cells: 6973

  924 cells: 4713

 1109 cells: 3229

 1331 cells: 2062

 1597 cells: 1338

 1916 cells: 773

 2299 cells: 495

 2759 cells: 268

 3311 cells: 150

 3973 cells: 100

 4768 cells: 42

 5722 cells: 24

 6866 cells: 12

 8239 cells: 9

 9887 cells: 3

11864 cells: 0

14237 cells: 5

17084 cells: 1

20501 cells: 0

24601 cells: 2

29521 cells: 0

35425 cells: 0

42510 cells: 0

51012 cells: 0

61214 cells: 2

From: DuyHai Doan <do...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Thursday, July 24, 2014 at 3:01 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Hot, large row

"How can I detect wide rows?" -->

nodetool cfhistograms <keyspace> <suspected column family>

Look at column "Column count" (last column) and identify a line in this column with very high value of "Offset". In a well designed application you should have a gaussian distribution where 80% of your row have a similar number of columns.

"Anyone know what debug level I can set so that I can see what reads the hot node is handling?  " -->

"nodetool settraceprobability <value>",  where value is a small number (0.001) on the node where you encounter the issue. Activate the tracing for a while (5 mins) and deactivate it (value = 0). Then look into system_traces tables "events" & "sessions". It may help or not since the tracing is done once every 1000.

"Any way to get the server to blacklist these wide rows automatically?" --> No


On Thu, Jul 24, 2014 at 8:48 PM, Keith Wright <kw...@nanigans.com>> wrote:
Hi all,

   We are seeing an issue where basically daily one of our nodes spikes in load and is churning in CMS heap pressure.  It appears that reads are backing up and my guess is that our application is reading a large row repeatedly.  Our write structure can lead itself to wide rows very infrequently (<0.001%) and we do our best to detect and delete them but obviously we’re missing a case.  Hoping for assistance on the following questions:

 *   How can I detect wide rows?
 *   Anyone know what debug level I can set so that I can see what reads the hot node is handling?  I’m hoping to see the “bad” row
 *   Any way to get the server to blacklist these wide rows automatically?

We’re using C* 2.0.6 with Vnodes.

Thanks

Re: Hot, large row

Posted by DuyHai Doan <do...@gmail.com>.

Your extract of cfhistograms show that there are no particular "wide rows".
The widest has 61214 cells which is big but not that huge to be really a
concern.

Turning on trace probabilty only tells give you some "hints" about what
kind of queries are done, it does not give the exact partition key nor
other statement values, especially when you are using prepared statements
...


"I am considering reducing read_request_timeout_in_ms: 5000 in
cassandra.yaml so that it reduces the impact when this occurs." --> Don't
do that, you'll only sweep dust under the carpet. Find the real issue and
fix it instead of changing parameter to hide it.

One solution would be on client side, to activate some logging to show the
CQL3 statements the application is issuing that may overload the server.  I
know that's better said than done but I don't have any other idea for the
moment

-------- Shameless self-promotion ------

To support this kind of live prod debugging & investigation that I add a
new dynamic query logging feature in Achilles:
https://github.com/doanduyhai/Achilles/wiki/Statements-Logging-and-Tracing#dynamic-statements-logging

Once you hit the issue, this kind of feature may save your day...






On Thu, Jul 24, 2014 at 9:22 PM, Keith Wright <kw...@nanigans.com> wrote:

> I can see from cfhistograms that I do have some wide rows (see below).  I
> set trace probability as you suggested but the output doesn’t appear to
> tell me what row was actually read unless I missed something.  I just see
> executing prepared statement.   Any ideas how I can find the row in
> question?
>
> I am considering reducing read_request_timeout_in_ms: 5000 in
> cassandra.yaml so that it reduces the impact when this occurs.
>
> Any help in identifying my issue would be GREATLY appreciated
>
> Cell Count per Partition
>
>     1 cells: 50449950
>
>     2 cells: 14281828
>
>     3 cells: 8093366
>
>     4 cells: 5029200
>
>     5 cells: 3103023
>
>     6 cells: 3059903
>
>     7 cells: 1903018
>
>     8 cells: 1509297
>
>    10 cells: 2420359
>
>    12 cells: 1624895
>
>    14 cells: 1171678
>
>    17 cells: 1289391
>
>    20 cells: 909777
>
>    24 cells: 852081
>
>    29 cells: 722925
>
>    35 cells: 587067
>
>    42 cells: 459473
>
>    50 cells: 358744
>
>    60 cells: 304146
>
>    72 cells: 244682
>
>    86 cells: 191045
>
>   103 cells: 155337
>
>   124 cells: 127061
>
>   149 cells: 98913
>
>   179 cells: 77454
>
>   215 cells: 59849
>
>   258 cells: 46117
>
>   310 cells: 35321
>
>   372 cells: 26319
>
>   446 cells: 19379
>
>   535 cells: 13783
>
>   642 cells: 9993
>
>   770 cells: 6973
>
>   924 cells: 4713
>
>  1109 cells: 3229
>
>  1331 cells: 2062
>
>  1597 cells: 1338
>
>  1916 cells: 773
>
>  2299 cells: 495
>
>  2759 cells: 268
>
>  3311 cells: 150
>
>  3973 cells: 100
>
>  4768 cells: 42
>
>  5722 cells: 24
>
>  6866 cells: 12
>
>  8239 cells: 9
>
>  9887 cells: 3
>
> 11864 cells: 0
>
> 14237 cells: 5
>
> 17084 cells: 1
>
> 20501 cells: 0
>
> 24601 cells: 2
>
> 29521 cells: 0
>
> 35425 cells: 0
>
> 42510 cells: 0
>
> 51012 cells: 0
>
> 61214 cells: 2
>
> From: DuyHai Doan <do...@gmail.com>
> Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
> Date: Thursday, July 24, 2014 at 3:01 PM
> To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
> Subject: Re: Hot, large row
>
> "How can I detect wide rows?" -->
>
> nodetool cfhistograms <keyspace> <suspected column family>
>
> Look at column "Column count" (last column) and identify a line in this
> column with very high value of "Offset". In a well designed application you
> should have a gaussian distribution where 80% of your row have a similar
> number of columns.
>
> "Anyone know what debug level I can set so that I can see what reads the
> hot node is handling?  " -->
>
> "nodetool settraceprobability <value>",  where value is a small number
> (0.001) on the node where you encounter the issue. Activate the tracing for
> a while (5 mins) and deactivate it (value = 0). Then look into
> system_traces tables "events" & "sessions". It may help or not since the
> tracing is done once every 1000.
>
> "Any way to get the server to blacklist these wide rows automatically?"
> --> No
>
>
> On Thu, Jul 24, 2014 at 8:48 PM, Keith Wright <kw...@nanigans.com>
> wrote:
>
>> Hi all,
>>
>>    We are seeing an issue where basically daily one of our nodes spikes
>> in load and is churning in CMS heap pressure.  It appears that reads are
>> backing up and my guess is that our application is reading a large row
>> repeatedly.  Our write structure can lead itself to wide rows very
>> infrequently (<0.001%) and we do our best to detect and delete them but
>> obviously we’re missing a case.  Hoping for assistance on the following
>> questions:
>>
>>    - How can I detect wide rows?
>>    - Anyone know what debug level I can set so that I can see what reads
>>    the hot node is handling?  I’m hoping to see the “bad” row
>>    - Any way to get the server to blacklist these wide rows
>>    automatically?
>>
>> We’re using C* 2.0.6 with Vnodes.
>>
>> Thanks
>>
>
>

Re: Hot, large row

Posted by Keith Wright <kw...@nanigans.com>.

I can see from cfhistograms that I do have some wide rows (see below).  I set trace probability as you suggested but the output doesn’t appear to tell me what row was actually read unless I missed something.  I just see executing prepared statement.   Any ideas how I can find the row in question?

I am considering reducing read_request_timeout_in_ms: 5000 in cassandra.yaml so that it reduces the impact when this occurs.

Any help in identifying my issue would be GREATLY appreciated


Cell Count per Partition

    1 cells: 50449950

    2 cells: 14281828

    3 cells: 8093366

    4 cells: 5029200

    5 cells: 3103023

    6 cells: 3059903

    7 cells: 1903018

    8 cells: 1509297

   10 cells: 2420359

   12 cells: 1624895

   14 cells: 1171678

   17 cells: 1289391

   20 cells: 909777

   24 cells: 852081

   29 cells: 722925

   35 cells: 587067

   42 cells: 459473

   50 cells: 358744

   60 cells: 304146

   72 cells: 244682

   86 cells: 191045

  103 cells: 155337

  124 cells: 127061

  149 cells: 98913

  179 cells: 77454

  215 cells: 59849

  258 cells: 46117

  310 cells: 35321

  372 cells: 26319

  446 cells: 19379

  535 cells: 13783

  642 cells: 9993

  770 cells: 6973

  924 cells: 4713

 1109 cells: 3229

 1331 cells: 2062

 1597 cells: 1338

 1916 cells: 773

 2299 cells: 495

 2759 cells: 268

 3311 cells: 150

 3973 cells: 100

 4768 cells: 42

 5722 cells: 24

 6866 cells: 12

 8239 cells: 9

 9887 cells: 3

11864 cells: 0

14237 cells: 5

17084 cells: 1

20501 cells: 0

24601 cells: 2

29521 cells: 0

35425 cells: 0

42510 cells: 0

51012 cells: 0

61214 cells: 2

From: DuyHai Doan <do...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Thursday, July 24, 2014 at 3:01 PM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Hot, large row

"How can I detect wide rows?" -->

nodetool cfhistograms <keyspace> <suspected column family>

Look at column "Column count" (last column) and identify a line in this column with very high value of "Offset". In a well designed application you should have a gaussian distribution where 80% of your row have a similar number of columns.

"Anyone know what debug level I can set so that I can see what reads the hot node is handling?  " -->

"nodetool settraceprobability <value>",  where value is a small number (0.001) on the node where you encounter the issue. Activate the tracing for a while (5 mins) and deactivate it (value = 0). Then look into system_traces tables "events" & "sessions". It may help or not since the tracing is done once every 1000.

"Any way to get the server to blacklist these wide rows automatically?" --> No


On Thu, Jul 24, 2014 at 8:48 PM, Keith Wright <kw...@nanigans.com>> wrote:
Hi all,

   We are seeing an issue where basically daily one of our nodes spikes in load and is churning in CMS heap pressure.  It appears that reads are backing up and my guess is that our application is reading a large row repeatedly.  Our write structure can lead itself to wide rows very infrequently (<0.001%) and we do our best to detect and delete them but obviously we’re missing a case.  Hoping for assistance on the following questions:

 *   How can I detect wide rows?
 *   Anyone know what debug level I can set so that I can see what reads the hot node is handling?  I’m hoping to see the “bad” row
 *   Any way to get the server to blacklist these wide rows automatically?

We’re using C* 2.0.6 with Vnodes.

Thanks

Re: Hot, large row

Posted by DuyHai Doan <do...@gmail.com>.

"How can I detect wide rows?" -->

nodetool cfhistograms <keyspace> <suspected column family>

Look at column "Column count" (last column) and identify a line in this
column with very high value of "Offset". In a well designed application you
should have a gaussian distribution where 80% of your row have a similar
number of columns.

"Anyone know what debug level I can set so that I can see what reads the
hot node is handling?  " -->

"nodetool settraceprobability <value>",  where value is a small number
(0.001) on the node where you encounter the issue. Activate the tracing for
a while (5 mins) and deactivate it (value = 0). Then look into
system_traces tables "events" & "sessions". It may help or not since the
tracing is done once every 1000.

"Any way to get the server to blacklist these wide rows automatically?" -->
No

On Thu, Jul 24, 2014 at 8:48 PM, Keith Wright <kw...@nanigans.com> wrote:

> Hi all,
>
>    We are seeing an issue where basically daily one of our nodes spikes in
> load and is churning in CMS heap pressure.  It appears that reads are
> backing up and my guess is that our application is reading a large row
> repeatedly.  Our write structure can lead itself to wide rows very
> infrequently (<0.001%) and we do our best to detect and delete them but
> obviously we’re missing a case.  Hoping for assistance on the following
> questions:
>
>    - How can I detect wide rows?
>    - Anyone know what debug level I can set so that I can see what reads
>    the hot node is handling?  I’m hoping to see the “bad” row
>    - Any way to get the server to blacklist these wide rows automatically?
>
> We’re using C* 2.0.6 with Vnodes.
>
> Thanks
>