You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Joel Samuelsson <sa...@gmail.com> on 2013/04/03 11:55:32 UTC

Re: Cassandra freezes

It seems this problem is back and I am unsure how to solve it. I have a
test setup like this:
4 machines run 8 processess each. Each process has 2 threads, 1 for writing
100 000 rows and one for reading another 100 000 rows. Each machine (and
process) read and write the exact same rows so it is essentially the same
200 000 rows being read / written.
The cassandra cluster is a one node cluster.
The first 10-20 runs of the test described above goes smoothly, after that
tests take increasingly long time with GC happening almost all the time.

Here is my CASS-FREEZE-001 form answers:

How big is your JVM heap ?
2GB

How many CPUs ?
A virtual environment so I can't be perfectly sure but according to their
specification, "8 cores".

Garbage collection taking long ? ( look for log lines from GCInspector)
Yes, these are a few lines seen during 1 test run:
INFO [ScheduledTasks:1] 2013-04-03 08:47:40,757 GCInspector.java (line 122)
GC for ParNew: 40370 ms for 3 collections, 565045688 used; max is 2038693888
INFO [ScheduledTasks:1] 2013-04-03 08:48:24,720 GCInspector.java (line 122)
GC for ParNew: 39840 ms for 2 collections, 614065528 used; max is 2038693888
INFO [ScheduledTasks:1] 2013-04-03 08:49:09,319 GCInspector.java (line 122)
GC for ParNew: 37666 ms for 2 collections, 682352952 used; max is 2038693888
INFO [ScheduledTasks:1] 2013-04-03 08:50:02,577 GCInspector.java (line 122)
GC for ParNew: 44590 ms for 1 collections, 792861352 used; max is 2038693888


Running out of heap ? ( "heap is .. full" log lines )
Yes. Same run as above:
WARN [ScheduledTasks:1] 2013-04-03 08:54:35,108 GCInspector.java (line 139)
Heap is 0.8596674853032178 full.  You may need to reduce memtable and/or
cache sizes.  Cassandra is now reducing cache sizes to free up memory.
 Adjust reduce_cache_sizes_at threshold in cassandra.yaml if you don't want
Cassandra to do this automatically
WARN [ScheduledTasks:1] 2013-04-03 08:54:36,831 GCInspector.java (line 145)
Heap is 0.8596674853032178 full.  You may need to reduce memtable and/or
cache sizes.  Cassandra will now flush up to the two largest memtables to
free up memory.  Adjust flush_largest_memtables_at threshold in
cassandra.yaml if you don't want Cassandra to do this automatically


Any tasks backing up / being dropped ? ( nodetool tpstats and ".. dropped
in last .. ms" log lines )
Yes. Same run as above:
INFO [ScheduledTasks:1] 2013-04-03 08:52:04,943 MessagingService.java (line
673) 31 MUTATION messages dropped in last 5000ms
INFO [ScheduledTasks:1] 2013-04-03 08:52:04,944 MessagingService.java (line
673) 8 READ messages dropped in last 5000ms

Are writes really slow? ( nodetool cfhistograms Keyspace ColumnFamily )
Not sure how to interpret the output of nodetool cfhistograms, but here it
is (I hope it's fairly readable):
Offset      SSTables     Write Latency      Read Latency          Row Size
     Column Count
1           38162520                 0                 0                 0
           200000
2                  0                22                 0                 0
                0
3                  0              1629                 0                 0
                0
4                  0              9990                 0                 0
                0
5                  0             40169                 0                 0
                0
6                  0            161538                 0                 0
                0
7                  0            487266                 0                 0
                0
8                  0           1096601                 0                 0
                0
10                 0           4842978                 0                 0
                0
12                 0           7976003                 0                 0
                0
14                 0           8673230                 0                 0
                0
17                 0           9805730                 0                 0
                0
20                 0           5083707                 0                 0
                0
24                 0           2541157                 0                 0
                0
29                 0            768916                 0                 0
                0
35                 0            220440                 0                 0
                0
42                 0            112915                 0                 0
                0
50                 0             71469                 0                 0
                0
60                 0             48909                 0                 0
                0
72                 0             50714                 0                 0
                0
86                 0             45390                 0                 0
                0
103                0             41975                 0                 0
                0
124                0             40371                 0                 0
                0
149                0             37103                 0                 0
                0
179                0             44631                 0                 0
                0
215                0             43957                 0                 0
                0
258                0             32499                 1                 0
                0
310                0             18446          23056779                 0
                0
372                0             13113          12580639                 0
                0
446                0              9862           1017347                 0
                0
535                0              7480            784506                 0
                0
642                0              5473            274274                 0
                0
770                0              4084             56379                 0
                0
924                0              3046             27979                 0
                0
1109               0              2205             20206            200000
                0
1331               0              1658             16947                 0
                0
1597               0              1228             16969                 0
                0
1916               0               896             15848                 0
                0
2299               0               542             13928                 0
                0
2759               0               379             11782                 0
                0
3311               0               326              9761                 0
                0
3973               0               540              8997                 0
                0
4768               0               450              7938                 0
                0
5722               0               270              6552                 0
                0
6866               0               170              6022                 0
                0
8239               0               146              6474                 0
                0
9887               0               166              7969                 0
                0
11864              0               176             53725                 0
                0
14237              0               203             10260                 0
                0
17084              0               255              6827                 0
                0
20501              0               312             27462                 0
                0
24601              0               445             11523                 0
                0
29521              0               736              9904                 0
                0
35425              0               909             20539                 0
                0
42510              0               896             14280                 0
                0
51012              0               904             12443                 0
                0
61214              0               715             11956                 0
                0
73457              0               652             10040                 0
                0
88148              0               474              7992                 0
                0
105778             0               256              5043                 0
                0
126934             0               113              2370                 0
                0
152321             0                75              1189                 0
                0
182785             0                39               690                 0
                0
219342             0                44               550                 0
                0
263210             0                69               551                 0
                0
315852             0                35               419                 0
                0
379022             0                35               564                 0
                0
454826             0                52               504                 0
                0
545791             0                79               749                 0
                0
654949             0                61               737                 0
                0
785939             0                30               399                 0
                0
943127             0                57               611                 0
                0
1131752            0                78               488                 0
                0
1358102            0                23               302                 0
                0
1629722            0                28               240                 0
                0
1955666            0                48               294                 0
                0
2346799            0                28               306                 0
                0
2816159            0                19               224                 0
                0
3379391            0                37               212                 0
                0
4055269            0                24               237                 0
                0
4866323            0                13               137                 0
                0
5839588            0                11                99                 0
                0
7007506            0                 4               115                 0
                0
8409007            0                16               194                 0
                0
10090808           0                12               156                 0
                0
12108970           0                12                54                 0
                0
14530764           0                24               147                 0
                0
17436917           0                10               114                 0
                0
20924300           0                 3                66                 0
                0
25109160           0                22               100                 0
                0+

Some of the write latencies looks really bad, but since they have column
count 0 for most, I am not sure what to make of it.



How much is lots of data?
Lots of data might have been an exaggeration but the test is as described
above. Each row read or written is about 1kb in size so each test run
generates 4 (machines) * 8 (processes) * 2 (read and write) * 100 000
(rows) * 1kb (row size) = 6250 mb read or written (half of each)

Wide or skinny rows?
Skinny rows, only a single column is used for each row.

Mutations/sec ?
The test when run on a freshly rebooted cassandra takes around 390 seconds,
and 6400000 rows are read / written during this time period so around 16410
mutations / second (unless I missunderstood the word mutation).

Which Compaction Strategy are you using?
SizeTieredCompactionStrategy

Output of show schema (
cassandra-cli ) for the relevant Keyspace/CF might help as well
create column family rawData
  with column_type = 'Standard'
  and comparator = 'UTF8Type'
  and default_validation_class = 'BytesType'
  and key_validation_class = 'BytesType'
  and read_repair_chance = 0.1
  and dclocal_read_repair_chance = 0.0
  and gc_grace = 864000
  and min_compaction_threshold = 4
  and max_compaction_threshold = 32
  and replicate_on_write = true
  and compaction_strategy =
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
  and caching = 'KEYS_ONLY'
  and column_metadata = [
    {column_name : 'created',
    validation_class : LongType},
    {column_name : 'socketHash',
    validation_class : Int32Type},
    {column_name : 'data',
    validation_class : UTF8Type},
    {column_name : 'guid',
    validation_class : UTF8Type},
    {column_name : 'evaluated',
    validation_class : Int32Type,
    index_name : 'rawData_evaluated_idx_1',
    index_type : 0}]
  and compression_options = {'sstable_compression' :
'org.apache.cassandra.io.compress.SnappyCompressor'};

Only the "data" column is used during the test.



What consistency are you doing your writes with ?
I am writing with consistency level ONE.


What are the values for these settings in cassandra.yaml

memtable_total_space_in_mb: No value set in cassandra.yaml, so 1/3 of heap
according to documentation (2gb / 3)
memtable_flush_writers: No value set in cassandra.yaml, but only one data
directory so I assume it is 1.
memtable_flush_queue_size: 4
compaction_throughput_mb_per_sec: 16
concurrent_writes: 32


Which version of Cassandra?
1.1.8


Hope this helps you help me :)
Best regards,
Joel Samuelsson


2013/3/22 Joel Samuelsson <sa...@gmail.com>

> Thanks for the GC suggestion. It seems we didn't have enough CPU power to
> handle both the data and GC. Increasing the number of CPU cores made
> everything run smoothly at the same load.
>
>
> 2013/3/21 Andras Szerdahelyi <an...@ignitionone.com>
>
>>  Neat!
>> Thanks.
>>
>>   From: Sylvain Lebresne <sy...@datastax.com>
>> Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
>> Date: Thursday 21 March 2013 10:10
>> To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
>> Subject: Re: Cassandra freezes
>>
>>   Prior to 1.2 the index summaries were not saved on disk, and were thus
>> computed on startup while the sstable was loaded. In 1.2 they now are saved
>> on disk to make startup faster (
>> https://issues.apache.org/jira/browse/CASSANDRA-2392). That being said,
>> if the index_interval value used by a summary saved doesn't match the
>> current one while the sstable is loaded, the summary is recomputed anyway,
>> so restarting a node should always take a new index_interval setting into
>> account.
>>
>>  --
>> Sylvain
>>
>>
>> On Thu, Mar 21, 2013 at 9:43 AM, Andras Szerdahelyi <
>> andras.szerdahelyi@ignitionone.com> wrote:
>>
>>> I can not find the reference that notes having to upgradesstables when
>>> you
>>> change this. I really hope such complex assumptions are not formulating
>>> in
>>> my head just on their own and there actually exists some kind of reliable
>>> reference that clears this up :-) but,
>>>
>>> # index_interval controls the sampling of entries from the primrary
>>> # row index in terms of space versus time. The larger the interval,
>>> # the smaller and less effective the sampling will be. In technicial
>>> # terms, the interval coresponds to the number of index entries that
>>> # are skipped between taking each sample. All the sampled entries
>>> # must fit in memory. Generally, a value between 128 and 512 here
>>> # coupled with a large key cache size on CFs results in the best trade
>>> # offs. This value is not often changed, however if you have many
>>> # very small rows (many to an OS page), then increasing this will
>>> # often lower memory usage without a impact on performance.
>>>
>>> it is ( very ) safe to assume the row index is re-built/updated when new
>>> sstables are built.
>>> Obviously the sample of this index will have to follow this process very
>>> closely.
>>>
>>> It is possible however that the sample itself is not persisted and is
>>> built at startup as opposed to *only* when the index changes.( which is
>>> what I thought was happening )
>>> It shouldn't be too difficult to verify this, but I'd appreciate if
>>> someone who looked at this before could confirm if this is the case.
>>>
>>> Thanks,
>>> Andras
>>>
>>> On 21/03/13 09:13, "Michal Michalski" <mi...@opera.com> wrote:
>>>
>>> >About index_interval:
>>> >
>>> >> 1) you have to rebuild stables ( not an issue if you are evaluating,
>>> >>doing
>>> >> test writes.. Etc, not so much in production )
>>> >
>>> >Are you sure of this? As I understand indexes, it's not required because
>>> >this parameter defines an interval of in-memory index sample, which is
>>> >created during C* startup basing on a primary on-disk index file. The
>>> >fact that Heap usage is reduced immediately after C* restart seem to
>>> >confirm this, but maybe I miss something?
>>> >
>>> >M.
>>>
>>>
>>
>

Re: Cassandra freezes

Posted by "Hiller, Dean" <De...@nrel.gov>.

I am going to throw some info out there for you as it might help.

 1.  RAM usage grows with dataset size on that node(adding more nodes reduces the RAM used per node since each node has less rows).  index_interval can be upped to reduce RAM usage but be careful with it.  Switching to LCS and bloomfilter going to 0.1 can lower RAM usage but again, test, test, test first.

Aaron missed one other possibility(only because usually you increase to 8G RAM before doing this one)….add more nodes.  Cassandra performance stays extremely consistent up until you hit that per node limit.  It seems to me, you are researching what you can do per node which is a good thing and we had to o through that.  I think at some point, most teams do.  To accurately run the test though, you should be at 8G RAM as there are huge swing in RAM for compactions(maybe less so with LCS).

Recovering once you hit a RAM issue is not nice.  We had this in production once and had to temporarily increase to 12G and work out fixes until we got the new nodes.

Not sure if this helps at all, but it's good to know.

Later,
Dean

From: Joel Samuelsson <sa...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Thursday, April 4, 2013 5:49 AM
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Cassandra freezes

Yes, both of those solutions seem fine.

My question still seems valid though; shouldn't the node recover and perform as well as it did during the first few tests? If not, what makes the node not have the same issues at a smaller load but after a longer period of time? Having nodes' performance drop radically over time seems unacceptable and not something most people experience.

2013/4/4 aaron morton <aa...@thelastpickle.com>>
INFO [ScheduledTasks:1] 2013-04-03 08:47:40,757 GCInspector.java (line 122) GC for ParNew: 40370 ms for 3 collections, 565045688 used; max is 2038693888<tel:2038693888>
This is the JVM pausing the application for 40 seconds to complete GC.

You have two choices, use a bigger heap (4Gb to 8GB) or have a lower workload.

 cheers

-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 3/04/2013, at 7:54 PM, Joel Samuelsson <sa...@gmail.com>> wrote:

Thanks for your suggestions. I'll get back to you with the tests you suggested, probably tomorrow. In the meantime though I have a few questions. You say:
- 2GB of JVM heap to be insufficient to run this workload against Cassandra

I realise that the one node cluster has a maximum workload. It did however work fine for a few tests and then performance deteriorated. Currently I can't even complete a test run since the server won't respond in time - even though I haven't run a test since yesterday. Shouldn't the server "recover" sooner or later and perform as well as it did during the first few tests? If not automatically, what can I do to help it? Tried nodetool flush but with no performance improvement.

And just fyi in case it changes anything, I don't immediately read back the written rows. There are 100 000 rows being written and 100 000 rows being read in parallell. The rows being read were written to the cluster before the tests were run.

2013/4/3 Andras Szerdahelyi <an...@ignitionone.com>>
Wrt/ cfhistograms output, you are supposed to consider "Offset" x "column values of each column" a separate histogram. Also AFAIK, these counters are flushed after each invocation, so you are always looking at data from between two invocations of cfhistograms   - With that in mind, to me your cfhistograms say :
- you encountered 200k rows with a single column

- most of your write latencies are agreeable but – and I can not comment on how much a commit log write ( an append ) would affect this as I have durable_writes:false on all my data - that’s a long tail you have there, in to the hundreds of milliseconds which can not be OK.
Depending on how often your memtables are switched ( emptied and queued for flushing to disk ) and how valuable your updates received in between two of these are, you may want to disable durable writes on the KS with "durable_writes=false", or the very least place the commit log folder on its own drive. Again, I'm not absolutely sure this could affect write latency

- 38162520 reads served with a single sstable read, that’s great

- a big chunk of these reads are served from page cache or memtables ( the latter being more likely since, as I understand , you immediately read back the written column and you work with unique row keys ) , but again you have a long drop off

16410 mutations / sec, with 1k payload, lets say that is 20MB/s in to memory with overhead, 3rd of the 2GB heap for memtables = 666MB : a switch every ~30 seconds.
I'm not sure if your write performance can be attributed to GC only, so can you sit through one of these tests with :

watch -n2 iostat –xmd devicename ( look for avg–qusz and what your peak write throughput is )
watch -n2 nodetool tpstats ( see if you have flushes blocked. )

In the end I think you'll either find
- 2GB of JVM heap to be insufficient to run this workload against Cassandra
- or your single disk serving your data directory being unable to keep up with having to flush 20MB/s sustained write every 30 seconds ( unlikely unless you are on EC2 EBS drives )

Also, just be sure: restart cassandra before the test and confirm your benchmark application is doing what you think its doing in terms of reads/writes with nodetool cfstats

Regards,
Andras

From: Joel Samuelsson <sa...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Wednesday 3 April 2013 11:55

To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Cassandra freezes

It seems this problem is back and I am unsure how to solve it. I have a test setup like this:
4 machines run 8 processess each. Each process has 2 threads, 1 for writing 100 000 rows and one for reading another 100 000 rows. Each machine (and process) read and write the exact same rows so it is essentially the same 200 000 rows being read / written.
The cassandra cluster is a one node cluster.
The first 10-20 runs of the test described above goes smoothly, after that tests take increasingly long time with GC happening almost all the time.

Here is my CASS-FREEZE-001 form answers:

How big is your JVM heap ?
2GB

How many CPUs ?
A virtual environment so I can't be perfectly sure but according to their specification, "8 cores".

Garbage collection taking long ? ( look for log lines from GCInspector)
Yes, these are a few lines seen during 1 test run:
INFO [ScheduledTasks:1] 2013-04-03 08:47:40,757 GCInspector.java (line 122) GC for ParNew: 40370 ms for 3 collections, 565045688 used; max is 2038693888<tel:2038693888>
INFO [ScheduledTasks:1] 2013-04-03 08:48:24,720 GCInspector.java (line 122) GC for ParNew: 39840 ms for 2 collections, 614065528 used; max is 2038693888<tel:2038693888>
INFO [ScheduledTasks:1] 2013-04-03 08:49:09,319 GCInspector.java (line 122) GC for ParNew: 37666 ms for 2 collections, 682352952 used; max is 2038693888<tel:2038693888>
INFO [ScheduledTasks:1] 2013-04-03 08:50:02,577 GCInspector.java (line 122) GC for ParNew: 44590 ms for 1 collections, 792861352 used; max is 2038693888<tel:2038693888>

Running out of heap ? ( "heap is .. full" log lines )
Yes. Same run as above:
WARN [ScheduledTasks:1] 2013-04-03 08:54:35,108 GCInspector.java (line 139) Heap is 0.8596674853032178 full.  You may need to reduce memtable and/or cache sizes.  Cassandra is now reducing cache sizes to free up memory.  Adjust reduce_cache_sizes_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically
WARN [ScheduledTasks:1] 2013-04-03 08:54:36,831 GCInspector.java (line 145) Heap is 0.8596674853032178 full.  You may need to reduce memtable and/or cache sizes.  Cassandra will now flush up to the two largest memtables to free up memory.  Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically

Any tasks backing up / being dropped ? ( nodetool tpstats and ".. dropped
in last .. ms" log lines )
Yes. Same run as above:
INFO [ScheduledTasks:1] 2013-04-03 08:52:04,943 MessagingService.java (line 673) 31 MUTATION messages dropped in last 5000ms
INFO [ScheduledTasks:1] 2013-04-03 08:52:04,944 MessagingService.java (line 673) 8 READ messages dropped in last 5000ms

Are writes really slow? ( nodetool cfhistograms Keyspace ColumnFamily )
Not sure how to interpret the output of nodetool cfhistograms, but here it is (I hope it's fairly readable):
Offset      SSTables     Write Latency      Read Latency          Row Size      Column Count
1           38162520                 0                 0                 0            200000
2                  0                22                 0                 0                 0
3                  0              1629                 0                 0                 0
4                  0              9990                 0                 0                 0
5                  0             40169                 0                 0                 0
6                  0            161538                 0                 0                 0
7                  0            487266                 0                 0                 0
8                  0           1096601                 0                 0                 0
10                 0           4842978                 0                 0                 0
12                 0           7976003                 0                 0                 0
14                 0           8673230                 0                 0                 0
17                 0           9805730                 0                 0                 0
20                 0           5083707                 0                 0                 0
24                 0           2541157                 0                 0                 0
29                 0            768916                 0                 0                 0
35                 0            220440                 0                 0                 0
42                 0            112915                 0                 0                 0
50                 0             71469                 0                 0                 0
60                 0             48909                 0                 0                 0
72                 0             50714                 0                 0                 0
86                 0             45390                 0                 0                 0
103                0             41975                 0                 0                 0
124                0             40371                 0                 0                 0
149                0             37103                 0                 0                 0
179                0             44631                 0                 0                 0
215                0             43957                 0                 0                 0
258                0             32499                 1                 0                 0
310                0             18446          23056779                 0                 0
372                0             13113          12580639                 0                 0
446                0              9862           1017347                 0                 0
535                0              7480            784506                 0                 0
642                0              5473            274274                 0                 0
770                0              4084             56379                 0                 0
924                0              3046             27979                 0                 0
1109               0              2205             20206            200000                 0
1331               0              1658             16947                 0                 0
1597               0              1228             16969                 0                 0
1916               0               896             15848                 0                 0
2299               0               542             13928                 0                 0
2759               0               379             11782                 0                 0
3311               0               326              9761                 0                 0
3973               0               540              8997                 0                 0
4768               0               450              7938                 0                 0
5722               0               270              6552                 0                 0
6866               0               170              6022                 0                 0
8239               0               146              6474                 0                 0
9887               0               166              7969                 0                 0
11864              0               176             53725                 0                 0
14237              0               203             10260                 0                 0
17084              0               255              6827                 0                 0
20501              0               312             27462                 0                 0
24601              0               445             11523                 0                 0
29521              0               736              9904                 0                 0
35425              0               909             20539                 0                 0
42510              0               896             14280                 0                 0
51012              0               904             12443                 0                 0
61214              0               715             11956                 0                 0
73457              0               652             10040                 0                 0
88148              0               474              7992                 0                 0
105778             0               256              5043                 0                 0
126934             0               113              2370                 0                 0
152321             0                75              1189                 0                 0
182785             0                39               690                 0                 0
219342             0                44               550                 0                 0
263210             0                69               551                 0                 0
315852             0                35               419                 0                 0
379022             0                35               564                 0                 0
454826             0                52               504                 0                 0
545791             0                79               749                 0                 0
654949             0                61               737                 0                 0
785939             0                30               399                 0                 0
943127             0                57               611                 0                 0
1131752            0                78               488                 0                 0
1358102            0                23               302                 0                 0
1629722            0                28               240                 0                 0
1955666            0                48               294                 0                 0
2346799            0                28               306                 0                 0
2816159            0                19               224                 0                 0
3379391            0                37               212                 0                 0
4055269            0                24               237                 0                 0
4866323            0                13               137                 0                 0
5839588            0                11                99                 0                 0
7007506            0                 4               115                 0                 0
8409007            0                16               194                 0                 0
10090808           0                12               156                 0                 0
12108970           0                12                54                 0                 0
14530764           0                24               147                 0                 0
17436917           0                10               114                 0                 0
20924300           0                 3                66                 0                 0
25109160           0                22               100                 0                 0+

Some of the write latencies looks really bad, but since they have column count 0 for most, I am not sure what to make of it.

How much is lots of data?
Lots of data might have been an exaggeration but the test is as described above. Each row read or written is about 1kb in size so each test run generates 4 (machines) * 8 (processes) * 2 (read and write) * 100 000 (rows) * 1kb (row size) = 6250 mb read or written (half of each)

Wide or skinny rows?
Skinny rows, only a single column is used for each row.

Mutations/sec ?
The test when run on a freshly rebooted cassandra takes around 390 seconds, and 6400000 rows are read / written during this time period so around 16410 mutations / second (unless I missunderstood the word mutation).

Which Compaction Strategy are you using?
SizeTieredCompactionStrategy

Output of show schema (
cassandra-cli ) for the relevant Keyspace/CF might help as well
create column family rawData
  with column_type = 'Standard'
  and comparator = 'UTF8Type'
  and default_validation_class = 'BytesType'
  and key_validation_class = 'BytesType'
  and read_repair_chance = 0.1
  and dclocal_read_repair_chance = 0.0
  and gc_grace = 864000
  and min_compaction_threshold = 4
  and max_compaction_threshold = 32
  and replicate_on_write = true
  and compaction_strategy = 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
  and caching = 'KEYS_ONLY'
  and column_metadata = [
    {column_name : 'created',
    validation_class : LongType},
    {column_name : 'socketHash',
    validation_class : Int32Type},
    {column_name : 'data',
    validation_class : UTF8Type},
    {column_name : 'guid',
    validation_class : UTF8Type},
    {column_name : 'evaluated',
    validation_class : Int32Type,
    index_name : 'rawData_evaluated_idx_1',
    index_type : 0}]
  and compression_options = {'sstable_compression' : 'org.apache.cassandra.io.compress.SnappyCompressor'};

Only the "data" column is used during the test.

What consistency are you doing your writes with ?
I am writing with consistency level ONE.

What are the values for these settings in cassandra.yaml

memtable_total_space_in_mb: No value set in cassandra.yaml, so 1/3 of heap according to documentation (2gb / 3)
memtable_flush_writers: No value set in cassandra.yaml, but only one data directory so I assume it is 1.
memtable_flush_queue_size: 4
compaction_throughput_mb_per_sec: 16
concurrent_writes: 32

Which version of Cassandra?
1.1.8

Hope this helps you help me :)
Best regards,
Joel Samuelsson

2013/3/22 Joel Samuelsson <sa...@gmail.com>>
Thanks for the GC suggestion. It seems we didn't have enough CPU power to handle both the data and GC. Increasing the number of CPU cores made everything run smoothly at the same load.

2013/3/21 Andras Szerdahelyi <an...@ignitionone.com>>
Neat!
Thanks.

From: Sylvain Lebresne <sy...@datastax.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Thursday 21 March 2013 10:10
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Cassandra freezes

Prior to 1.2 the index summaries were not saved on disk, and were thus computed on startup while the sstable was loaded. In 1.2 they now are saved on disk to make startup faster (https://issues.apache.org/jira/browse/CASSANDRA-2392). That being said, if the index_interval value used by a summary saved doesn't match the current one while the sstable is loaded, the summary is recomputed anyway, so restarting a node should always take a new index_interval setting into account.

--
Sylvain

On Thu, Mar 21, 2013 at 9:43 AM, Andras Szerdahelyi <an...@ignitionone.com>> wrote:
I can not find the reference that notes having to upgradesstables when you
change this. I really hope such complex assumptions are not formulating in
my head just on their own and there actually exists some kind of reliable
reference that clears this up :-) but,

# index_interval controls the sampling of entries from the primrary
# row index in terms of space versus time. The larger the interval,
# the smaller and less effective the sampling will be. In technicial
# terms, the interval coresponds to the number of index entries that
# are skipped between taking each sample. All the sampled entries
# must fit in memory. Generally, a value between 128 and 512 here
# coupled with a large key cache size on CFs results in the best trade
# offs. This value is not often changed, however if you have many
# very small rows (many to an OS page), then increasing this will
# often lower memory usage without a impact on performance.

it is ( very ) safe to assume the row index is re-built/updated when new
sstables are built.
Obviously the sample of this index will have to follow this process very
closely.

It is possible however that the sample itself is not persisted and is
built at startup as opposed to *only* when the index changes.( which is
what I thought was happening )
It shouldn't be too difficult to verify this, but I'd appreciate if
someone who looked at this before could confirm if this is the case.

Thanks,
Andras

On 21/03/13 09:13, "Michal Michalski" <mi...@opera.com>> wrote:

>About index_interval:
>
>> 1) you have to rebuild stables ( not an issue if you are evaluating,
>>doing
>> test writes.. Etc, not so much in production )
>
>Are you sure of this? As I understand indexes, it's not required because
>this parameter defines an interval of in-memory index sample, which is
>created during C* startup basing on a primary on-disk index file. The
>fact that Heap usage is reduced immediately after C* restart seem to
>confirm this, but maybe I miss something?
>
>M.

Re: Cassandra freezes

Posted by Joel Samuelsson <sa...@gmail.com>.

Yes, both of those solutions seem fine.

My question still seems valid though; shouldn't the node recover and
perform as well as it did during the first few tests? If not, what makes
the node not have the same issues at a smaller load but after a longer
period of time? Having nodes' performance drop radically over time seems
unacceptable and not something most people experience.


2013/4/4 aaron morton <aa...@thelastpickle.com>

> INFO [ScheduledTasks:1] 2013-04-03 08:47:40,757 GCInspector.java (line
> 122) GC for ParNew: 40370 ms for 3 collections, 565045688 used; max is
> 2038693888
>
> This is the JVM pausing the application for 40 seconds to complete GC.
>
> You have two choices, use a bigger heap (4Gb to 8GB) or have a lower
> workload.
>
>  cheers
>
>
> -----------------
> Aaron Morton
> Freelance Cassandra Consultant
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 3/04/2013, at 7:54 PM, Joel Samuelsson <sa...@gmail.com>
> wrote:
>
> Thanks for your suggestions. I'll get back to you with the tests you
> suggested, probably tomorrow. In the meantime though I have a few
> questions. You say:
> - 2GB of JVM heap to be insufficient to run this workload against
> Cassandra
>
> I realise that the one node cluster has a maximum workload. It did however
> work fine for a few tests and then performance deteriorated. Currently I
> can't even complete a test run since the server won't respond in time -
> even though I haven't run a test since yesterday. Shouldn't the server
> "recover" sooner or later and perform as well as it did during the first
> few tests? If not automatically, what can I do to help it? Tried nodetool
> flush but with no performance improvement.
>
> And just fyi in case it changes anything, I don't immediately read back
> the written rows. There are 100 000 rows being written and 100 000 rows
> being read in parallell. The rows being read were written to the cluster
> before the tests were run.
>
>
> 2013/4/3 Andras Szerdahelyi <an...@ignitionone.com>
>
>>  Wrt/ cfhistograms output, you are supposed to consider "Offset" x
>> "column values of each column" a separate histogram. Also AFAIK, these
>> counters are flushed after each invocation, so you are always looking at
>> data from between two invocations of cfhistograms   - With that in mind, to
>> me your cfhistograms say :
>>  - you encountered 200k rows with a single column
>>
>>  - most of your write latencies are agreeable but – and I can not
>> comment on how much a commit log write ( an append ) would affect this as I
>> have durable_writes:false on all my data - that’s a long tail you have
>> there, in to the hundreds of milliseconds which can not be OK.
>>  Depending on how often your memtables are switched ( emptied and queued
>> for flushing to disk ) and how valuable your updates received in between
>> two of these are, you may want to disable durable writes on the KS with
>> "durable_writes=false", or the very least place the commit log folder on
>> its own drive. Again, I'm not absolutely sure this could affect write
>> latency
>>
>>  - 38162520 reads served with a single sstable read, that’s great
>>
>>  - a big chunk of these reads are served from page cache or memtables (
>> the latter being more likely since, as I understand , you immediately read
>> back the written column and you work with unique row keys ) , but again
>> you have a long drop off
>>
>>  16410 mutations / sec, with 1k payload, lets say that is 20MB/s in to
>> memory with overhead, 3rd of the 2GB heap for memtables = 666MB : a switch
>> every ~30 seconds.
>> I'm not sure if your write performance can be attributed to GC only, so
>> can you sit through one of these tests with :
>>
>>  watch -n2 iostat –xmd devicename ( look for avg–qusz and what your peak
>> write throughput is )
>> watch -n2 nodetool tpstats ( see if you have flushes blocked. )
>>
>>  In the end I think you'll either find
>>  - 2GB of JVM heap to be insufficient to run this workload against
>> Cassandra
>>  - or your single disk serving your data directory being unable to keep
>> up with having to flush 20MB/s sustained write every 30 seconds ( unlikely
>> unless you are on EC2 EBS drives )
>>
>>  Also, just be sure: restart cassandra before the test and confirm your
>> benchmark application is doing what you think its doing in terms of
>> reads/writes with nodetool cfstats
>>
>>  Regards,
>>  Andras
>>
>>   From: Joel Samuelsson <sa...@gmail.com>
>> Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
>> Date: Wednesday 3 April 2013 11:55
>>
>> To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
>> Subject: Re: Cassandra freezes
>>
>>   It seems this problem is back and I am unsure how to solve it. I have
>> a test setup like this:
>> 4 machines run 8 processess each. Each process has 2 threads, 1 for
>> writing 100 000 rows and one for reading another 100 000 rows. Each machine
>> (and process) read and write the exact same rows so it is essentially the
>> same 200 000 rows being read / written.
>> The cassandra cluster is a one node cluster.
>> The first 10-20 runs of the test described above goes smoothly, after
>> that tests take increasingly long time with GC happening almost all the
>> time.
>>
>>  Here is my CASS-FREEZE-001 form answers:
>>
>>  How big is your JVM heap ?
>> 2GB
>>
>>  How many CPUs ?
>> A virtual environment so I can't be perfectly sure but according to their
>> specification, "8 cores".
>>
>> Garbage collection taking long ? ( look for log lines from GCInspector)
>> Yes, these are a few lines seen during 1 test run:
>> INFO [ScheduledTasks:1] 2013-04-03 08:47:40,757 GCInspector.java (line
>> 122) GC for ParNew: 40370 ms for 3 collections, 565045688 used; max is
>> 2038693888
>>  INFO [ScheduledTasks:1] 2013-04-03 08:48:24,720 GCInspector.java (line
>> 122) GC for ParNew: 39840 ms for 2 collections, 614065528 used; max is
>> 2038693888
>> INFO [ScheduledTasks:1] 2013-04-03 08:49:09,319 GCInspector.java (line
>> 122) GC for ParNew: 37666 ms for 2 collections, 682352952 used; max is
>> 2038693888
>>  INFO [ScheduledTasks:1] 2013-04-03 08:50:02,577 GCInspector.java (line
>> 122) GC for ParNew: 44590 ms for 1 collections, 792861352 used; max is
>> 2038693888
>>
>>
>>  Running out of heap ? ( "heap is .. full" log lines )
>> Yes. Same run as above:
>> WARN [ScheduledTasks:1] 2013-04-03 08:54:35,108 GCInspector.java (line
>> 139) Heap is 0.8596674853032178 full.  You may need to reduce memtable
>> and/or cache sizes.  Cassandra is now reducing cache sizes to free up
>> memory.  Adjust reduce_cache_sizes_at threshold in cassandra.yaml if you
>> don't want Cassandra to do this automatically
>>  WARN [ScheduledTasks:1] 2013-04-03 08:54:36,831 GCInspector.java (line
>> 145) Heap is 0.8596674853032178 full.  You may need to reduce memtable
>> and/or cache sizes.  Cassandra will now flush up to the two largest
>> memtables to free up memory.  Adjust flush_largest_memtables_at threshold
>> in cassandra.yaml if you don't want Cassandra to do this automatically
>>
>>
>> Any tasks backing up / being dropped ? ( nodetool tpstats and ".. dropped
>> in last .. ms" log lines )
>> Yes. Same run as above:
>> INFO [ScheduledTasks:1] 2013-04-03 08:52:04,943 MessagingService.java
>> (line 673) 31 MUTATION messages dropped in last 5000ms
>>  INFO [ScheduledTasks:1] 2013-04-03 08:52:04,944 MessagingService.java
>> (line 673) 8 READ messages dropped in last 5000ms
>>
>> Are writes really slow? ( nodetool cfhistograms Keyspace ColumnFamily )
>> Not sure how to interpret the output of nodetool cfhistograms, but here
>> it is (I hope it's fairly readable):
>>  Offset      SSTables     Write Latency      Read Latency          Row
>> Size      Column Count
>> 1           38162520                 0                 0
>> 0            200000
>> 2                  0                22                 0
>> 0                 0
>> 3                  0              1629                 0
>> 0                 0
>> 4                  0              9990                 0
>> 0                 0
>> 5                  0             40169                 0
>> 0                 0
>> 6                  0            161538                 0
>> 0                 0
>> 7                  0            487266                 0
>> 0                 0
>> 8                  0           1096601                 0
>> 0                 0
>> 10                 0           4842978                 0
>> 0                 0
>> 12                 0           7976003                 0
>> 0                 0
>> 14                 0           8673230                 0
>> 0                 0
>> 17                 0           9805730                 0
>> 0                 0
>> 20                 0           5083707                 0
>> 0                 0
>> 24                 0           2541157                 0
>> 0                 0
>> 29                 0            768916                 0
>> 0                 0
>> 35                 0            220440                 0
>> 0                 0
>> 42                 0            112915                 0
>> 0                 0
>> 50                 0             71469                 0
>> 0                 0
>> 60                 0             48909                 0
>> 0                 0
>> 72                 0             50714                 0
>> 0                 0
>> 86                 0             45390                 0
>> 0                 0
>> 103                0             41975                 0
>> 0                 0
>> 124                0             40371                 0
>> 0                 0
>> 149                0             37103                 0
>> 0                 0
>> 179                0             44631                 0
>> 0                 0
>> 215                0             43957                 0
>> 0                 0
>> 258                0             32499                 1
>> 0                 0
>> 310                0             18446          23056779
>> 0                 0
>> 372                0             13113          12580639
>> 0                 0
>> 446                0              9862           1017347
>> 0                 0
>> 535                0              7480            784506
>> 0                 0
>> 642                0              5473            274274
>> 0                 0
>> 770                0              4084             56379
>> 0                 0
>> 924                0              3046             27979
>> 0                 0
>> 1109               0              2205             20206
>>  200000                 0
>> 1331               0              1658             16947
>> 0                 0
>> 1597               0              1228             16969
>> 0                 0
>> 1916               0               896             15848
>> 0                 0
>> 2299               0               542             13928
>> 0                 0
>> 2759               0               379             11782
>> 0                 0
>> 3311               0               326              9761
>> 0                 0
>> 3973               0               540              8997
>> 0                 0
>> 4768               0               450              7938
>> 0                 0
>> 5722               0               270              6552
>> 0                 0
>> 6866               0               170              6022
>> 0                 0
>> 8239               0               146              6474
>> 0                 0
>> 9887               0               166              7969
>> 0                 0
>> 11864              0               176             53725
>> 0                 0
>> 14237              0               203             10260
>> 0                 0
>> 17084              0               255              6827
>> 0                 0
>> 20501              0               312             27462
>> 0                 0
>> 24601              0               445             11523
>> 0                 0
>> 29521              0               736              9904
>> 0                 0
>> 35425              0               909             20539
>> 0                 0
>> 42510              0               896             14280
>> 0                 0
>> 51012              0               904             12443
>> 0                 0
>> 61214              0               715             11956
>> 0                 0
>> 73457              0               652             10040
>> 0                 0
>> 88148              0               474              7992
>> 0                 0
>> 105778             0               256              5043
>> 0                 0
>> 126934             0               113              2370
>> 0                 0
>> 152321             0                75              1189
>> 0                 0
>> 182785             0                39               690
>> 0                 0
>> 219342             0                44               550
>> 0                 0
>> 263210             0                69               551
>> 0                 0
>> 315852             0                35               419
>> 0                 0
>> 379022             0                35               564
>> 0                 0
>> 454826             0                52               504
>> 0                 0
>> 545791             0                79               749
>> 0                 0
>> 654949             0                61               737
>> 0                 0
>> 785939             0                30               399
>> 0                 0
>> 943127             0                57               611
>> 0                 0
>> 1131752            0                78               488
>> 0                 0
>> 1358102            0                23               302
>> 0                 0
>> 1629722            0                28               240
>> 0                 0
>> 1955666            0                48               294
>> 0                 0
>> 2346799            0                28               306
>> 0                 0
>> 2816159            0                19               224
>> 0                 0
>> 3379391            0                37               212
>> 0                 0
>> 4055269            0                24               237
>> 0                 0
>> 4866323            0                13               137
>> 0                 0
>> 5839588            0                11                99
>> 0                 0
>> 7007506            0                 4               115
>> 0                 0
>> 8409007            0                16               194
>> 0                 0
>> 10090808           0                12               156
>> 0                 0
>> 12108970           0                12                54
>> 0                 0
>> 14530764           0                24               147
>> 0                 0
>> 17436917           0                10               114
>> 0                 0
>> 20924300           0                 3                66
>> 0                 0
>> 25109160           0                22               100
>> 0                 0+
>>
>>  Some of the write latencies looks really bad, but since they have
>> column count 0 for most, I am not sure what to make of it.
>>
>>
>>
>> How much is lots of data?
>> Lots of data might have been an exaggeration but the test is as
>> described above. Each row read or written is about 1kb in size so each test
>> run generates 4 (machines) * 8 (processes) * 2 (read and write) * 100 000
>> (rows) * 1kb (row size) = 6250 mb read or written (half of each)
>>
>>  Wide or skinny rows?
>> Skinny rows, only a single column is used for each row.
>>
>>  Mutations/sec ?
>> The test when run on a freshly rebooted cassandra takes around 390
>> seconds, and 6400000 rows are read / written during this time period so
>> around 16410 mutations / second (unless I missunderstood the word mutation).
>>
>> Which Compaction Strategy are you using?
>> SizeTieredCompactionStrategy
>>
>>  Output of show schema (
>> cassandra-cli ) for the relevant Keyspace/CF might help as well
>>  create column family rawData
>>   with column_type = 'Standard'
>>   and comparator = 'UTF8Type'
>>   and default_validation_class = 'BytesType'
>>   and key_validation_class = 'BytesType'
>>   and read_repair_chance = 0.1
>>   and dclocal_read_repair_chance = 0.0
>>   and gc_grace = 864000
>>   and min_compaction_threshold = 4
>>   and max_compaction_threshold = 32
>>   and replicate_on_write = true
>>   and compaction_strategy =
>> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
>>   and caching = 'KEYS_ONLY'
>>   and column_metadata = [
>>     {column_name : 'created',
>>     validation_class : LongType},
>>     {column_name : 'socketHash',
>>     validation_class : Int32Type},
>>     {column_name : 'data',
>>     validation_class : UTF8Type},
>>     {column_name : 'guid',
>>     validation_class : UTF8Type},
>>     {column_name : 'evaluated',
>>     validation_class : Int32Type,
>>     index_name : 'rawData_evaluated_idx_1',
>>     index_type : 0}]
>>   and compression_options = {'sstable_compression' :
>> 'org.apache.cassandra.io.compress.SnappyCompressor'};
>>
>>  Only the "data" column is used during the test.
>>
>>
>>
>> What consistency are you doing your writes with ?
>> I am writing with consistency level ONE.
>>
>>
>> What are the values for these settings in cassandra.yaml
>>
>> memtable_total_space_in_mb: No value set in cassandra.yaml, so 1/3 of
>> heap according to documentation (2gb / 3)
>> memtable_flush_writers: No value set in cassandra.yaml, but only one data
>> directory so I assume it is 1.
>> memtable_flush_queue_size: 4
>> compaction_throughput_mb_per_sec: 16
>> concurrent_writes: 32
>>
>>
>> Which version of Cassandra?
>>  1.1.8
>>
>>
>>  Hope this helps you help me :)
>> Best regards,
>> Joel Samuelsson
>>
>>
>> 2013/3/22 Joel Samuelsson <sa...@gmail.com>
>>
>>> Thanks for the GC suggestion. It seems we didn't have enough CPU power
>>> to handle both the data and GC. Increasing the number of CPU cores made
>>> everything run smoothly at the same load.
>>>
>>>
>>> 2013/3/21 Andras Szerdahelyi <an...@ignitionone.com>
>>>
>>>>  Neat!
>>>> Thanks.
>>>>
>>>>   From: Sylvain Lebresne <sy...@datastax.com>
>>>> Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
>>>> Date: Thursday 21 March 2013 10:10
>>>> To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
>>>> Subject: Re: Cassandra freezes
>>>>
>>>>   Prior to 1.2 the index summaries were not saved on disk, and were
>>>> thus computed on startup while the sstable was loaded. In 1.2 they now are
>>>> saved on disk to make startup faster (
>>>> https://issues.apache.org/jira/browse/CASSANDRA-2392). That being
>>>> said, if the index_interval value used by a summary saved doesn't match the
>>>> current one while the sstable is loaded, the summary is recomputed anyway,
>>>> so restarting a node should always take a new index_interval setting into
>>>> account.
>>>>
>>>>  --
>>>> Sylvain
>>>>
>>>>
>>>> On Thu, Mar 21, 2013 at 9:43 AM, Andras Szerdahelyi <
>>>> andras.szerdahelyi@ignitionone.com> wrote:
>>>>
>>>>> I can not find the reference that notes having to upgradesstables when
>>>>> you
>>>>> change this. I really hope such complex assumptions are not
>>>>> formulating in
>>>>> my head just on their own and there actually exists some kind of
>>>>> reliable
>>>>> reference that clears this up :-) but,
>>>>>
>>>>> # index_interval controls the sampling of entries from the primrary
>>>>> # row index in terms of space versus time. The larger the interval,
>>>>> # the smaller and less effective the sampling will be. In technicial
>>>>> # terms, the interval coresponds to the number of index entries that
>>>>> # are skipped between taking each sample. All the sampled entries
>>>>> # must fit in memory. Generally, a value between 128 and 512 here
>>>>> # coupled with a large key cache size on CFs results in the best trade
>>>>> # offs. This value is not often changed, however if you have many
>>>>> # very small rows (many to an OS page), then increasing this will
>>>>> # often lower memory usage without a impact on performance.
>>>>>
>>>>> it is ( very ) safe to assume the row index is re-built/updated when
>>>>> new
>>>>> sstables are built.
>>>>> Obviously the sample of this index will have to follow this process
>>>>> very
>>>>> closely.
>>>>>
>>>>> It is possible however that the sample itself is not persisted and is
>>>>> built at startup as opposed to *only* when the index changes.( which is
>>>>> what I thought was happening )
>>>>> It shouldn't be too difficult to verify this, but I'd appreciate if
>>>>> someone who looked at this before could confirm if this is the case.
>>>>>
>>>>> Thanks,
>>>>> Andras
>>>>>
>>>>> On 21/03/13 09:13, "Michal Michalski" <mi...@opera.com> wrote:
>>>>>
>>>>> >About index_interval:
>>>>> >
>>>>> >> 1) you have to rebuild stables ( not an issue if you are evaluating,
>>>>> >>doing
>>>>> >> test writes.. Etc, not so much in production )
>>>>> >
>>>>> >Are you sure of this? As I understand indexes, it's not required
>>>>> because
>>>>> >this parameter defines an interval of in-memory index sample, which is
>>>>> >created during C* startup basing on a primary on-disk index file. The
>>>>> >fact that Heap usage is reduced immediately after C* restart seem to
>>>>> >confirm this, but maybe I miss something?
>>>>> >
>>>>> >M.
>>>>>
>>>>>
>>>>
>>>
>>
>
>

Re: Cassandra freezes

Posted by aaron morton <aa...@thelastpickle.com>.

> INFO [ScheduledTasks:1] 2013-04-03 08:47:40,757 GCInspector.java (line 122) GC for ParNew: 40370 ms for 3 collections, 565045688 used; max is 2038693888

This is the JVM pausing the application for 40 seconds to complete GC.

You have two choices, use a bigger heap (4Gb to 8GB) or have a lower workload. 

 cheers

 
-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 3/04/2013, at 7:54 PM, Joel Samuelsson <sa...@gmail.com> wrote:

> Thanks for your suggestions. I'll get back to you with the tests you suggested, probably tomorrow. In the meantime though I have a few questions. You say:
> - 2GB of JVM heap to be insufficient to run this workload against Cassandra
> 
> I realise that the one node cluster has a maximum workload. It did however work fine for a few tests and then performance deteriorated. Currently I can't even complete a test run since the server won't respond in time - even though I haven't run a test since yesterday. Shouldn't the server "recover" sooner or later and perform as well as it did during the first few tests? If not automatically, what can I do to help it? Tried nodetool flush but with no performance improvement.
> 
> And just fyi in case it changes anything, I don't immediately read back the written rows. There are 100 000 rows being written and 100 000 rows being read in parallell. The rows being read were written to the cluster before the tests were run.
> 
> 
> 2013/4/3 Andras Szerdahelyi <an...@ignitionone.com>
> Wrt/ cfhistograms output, you are supposed to consider "Offset" x "column values of each column" a separate histogram. Also AFAIK, these counters are flushed after each invocation, so you are always looking at data from between two invocations of cfhistograms   - With that in mind, to me your cfhistograms say :
> - you encountered 200k rows with a single column
> 
> - most of your write latencies are agreeable but – and I can not comment on how much a commit log write ( an append ) would affect this as I have durable_writes:false on all my data - that’s a long tail you have there, in to the hundreds of milliseconds which can not be OK.
> Depending on how often your memtables are switched ( emptied and queued for flushing to disk ) and how valuable your updates received in between two of these are, you may want to disable durable writes on the KS with "durable_writes=false", or the very least place the commit log folder on its own drive. Again, I'm not absolutely sure this could affect write latency 
> 
> - 38162520 reads served with a single sstable read, that’s great
> 
> - a big chunk of these reads are served from page cache or memtables ( the latter being more likely since, as I understand , you immediately read back the written column and you work with unique row keys ) , but again you have a long drop off
> 
> 16410 mutations / sec, with 1k payload, lets say that is 20MB/s in to memory with overhead, 3rd of the 2GB heap for memtables = 666MB : a switch every ~30 seconds.
> I'm not sure if your write performance can be attributed to GC only, so can you sit through one of these tests with :
> 
> watch -n2 iostat –xmd devicename ( look for avg–qusz and what your peak write throughput is )
> watch -n2 nodetool tpstats ( see if you have flushes blocked. )
> 
> In the end I think you'll either find 
> - 2GB of JVM heap to be insufficient to run this workload against Cassandra
> - or your single disk serving your data directory being unable to keep up with having to flush 20MB/s sustained write every 30 seconds ( unlikely unless you are on EC2 EBS drives )
> 
> Also, just be sure: restart cassandra before the test and confirm your benchmark application is doing what you think its doing in terms of reads/writes with nodetool cfstats
> 
> Regards,
> Andras
> 
> From: Joel Samuelsson <sa...@gmail.com>
> Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
> Date: Wednesday 3 April 2013 11:55
> 
> To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
> Subject: Re: Cassandra freezes
> 
> It seems this problem is back and I am unsure how to solve it. I have a test setup like this:
> 4 machines run 8 processess each. Each process has 2 threads, 1 for writing 100 000 rows and one for reading another 100 000 rows. Each machine (and process) read and write the exact same rows so it is essentially the same 200 000 rows being read / written.
> The cassandra cluster is a one node cluster.
> The first 10-20 runs of the test described above goes smoothly, after that tests take increasingly long time with GC happening almost all the time.
> 
> Here is my CASS-FREEZE-001 form answers:
> 
> How big is your JVM heap ? 
> 2GB
> 
> How many CPUs ?
> A virtual environment so I can't be perfectly sure but according to their specification, "8 cores".
> 
> Garbage collection taking long ? ( look for log lines from GCInspector)
> Yes, these are a few lines seen during 1 test run:
> INFO [ScheduledTasks:1] 2013-04-03 08:47:40,757 GCInspector.java (line 122) GC for ParNew: 40370 ms for 3 collections, 565045688 used; max is 2038693888
> INFO [ScheduledTasks:1] 2013-04-03 08:48:24,720 GCInspector.java (line 122) GC for ParNew: 39840 ms for 2 collections, 614065528 used; max is 2038693888
> INFO [ScheduledTasks:1] 2013-04-03 08:49:09,319 GCInspector.java (line 122) GC for ParNew: 37666 ms for 2 collections, 682352952 used; max is 2038693888
> INFO [ScheduledTasks:1] 2013-04-03 08:50:02,577 GCInspector.java (line 122) GC for ParNew: 44590 ms for 1 collections, 792861352 used; max is 2038693888
> 
> 
> Running out of heap ? ( "heap is .. full" log lines )
> Yes. Same run as above:
> WARN [ScheduledTasks:1] 2013-04-03 08:54:35,108 GCInspector.java (line 139) Heap is 0.8596674853032178 full.  You may need to reduce memtable and/or cache sizes.  Cassandra is now reducing cache sizes to free up memory.  Adjust reduce_cache_sizes_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically
> WARN [ScheduledTasks:1] 2013-04-03 08:54:36,831 GCInspector.java (line 145) Heap is 0.8596674853032178 full.  You may need to reduce memtable and/or cache sizes.  Cassandra will now flush up to the two largest memtables to free up memory.  Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically
> 
> 
> Any tasks backing up / being dropped ? ( nodetool tpstats and ".. dropped
> in last .. ms" log lines )
> Yes. Same run as above:
> INFO [ScheduledTasks:1] 2013-04-03 08:52:04,943 MessagingService.java (line 673) 31 MUTATION messages dropped in last 5000ms
> INFO [ScheduledTasks:1] 2013-04-03 08:52:04,944 MessagingService.java (line 673) 8 READ messages dropped in last 5000ms
> 
> Are writes really slow? ( nodetool cfhistograms Keyspace ColumnFamily )
> Not sure how to interpret the output of nodetool cfhistograms, but here it is (I hope it's fairly readable):
> Offset      SSTables     Write Latency      Read Latency          Row Size      Column Count
> 1           38162520                 0                 0                 0            200000
> 2                  0                22                 0                 0                 0
> 3                  0              1629                 0                 0                 0
> 4                  0              9990                 0                 0                 0
> 5                  0             40169                 0                 0                 0
> 6                  0            161538                 0                 0                 0
> 7                  0            487266                 0                 0                 0
> 8                  0           1096601                 0                 0                 0
> 10                 0           4842978                 0                 0                 0
> 12                 0           7976003                 0                 0                 0
> 14                 0           8673230                 0                 0                 0
> 17                 0           9805730                 0                 0                 0
> 20                 0           5083707                 0                 0                 0
> 24                 0           2541157                 0                 0                 0
> 29                 0            768916                 0                 0                 0
> 35                 0            220440                 0                 0                 0
> 42                 0            112915                 0                 0                 0
> 50                 0             71469                 0                 0                 0
> 60                 0             48909                 0                 0                 0
> 72                 0             50714                 0                 0                 0
> 86                 0             45390                 0                 0                 0
> 103                0             41975                 0                 0                 0
> 124                0             40371                 0                 0                 0
> 149                0             37103                 0                 0                 0
> 179                0             44631                 0                 0                 0
> 215                0             43957                 0                 0                 0
> 258                0             32499                 1                 0                 0
> 310                0             18446          23056779                 0                 0
> 372                0             13113          12580639                 0                 0
> 446                0              9862           1017347                 0                 0
> 535                0              7480            784506                 0                 0
> 642                0              5473            274274                 0                 0
> 770                0              4084             56379                 0                 0
> 924                0              3046             27979                 0                 0
> 1109               0              2205             20206            200000                 0
> 1331               0              1658             16947                 0                 0
> 1597               0              1228             16969                 0                 0
> 1916               0               896             15848                 0                 0
> 2299               0               542             13928                 0                 0
> 2759               0               379             11782                 0                 0
> 3311               0               326              9761                 0                 0
> 3973               0               540              8997                 0                 0
> 4768               0               450              7938                 0                 0
> 5722               0               270              6552                 0                 0
> 6866               0               170              6022                 0                 0
> 8239               0               146              6474                 0                 0
> 9887               0               166              7969                 0                 0
> 11864              0               176             53725                 0                 0
> 14237              0               203             10260                 0                 0
> 17084              0               255              6827                 0                 0
> 20501              0               312             27462                 0                 0
> 24601              0               445             11523                 0                 0
> 29521              0               736              9904                 0                 0
> 35425              0               909             20539                 0                 0
> 42510              0               896             14280                 0                 0
> 51012              0               904             12443                 0                 0
> 61214              0               715             11956                 0                 0
> 73457              0               652             10040                 0                 0
> 88148              0               474              7992                 0                 0
> 105778             0               256              5043                 0                 0
> 126934             0               113              2370                 0                 0
> 152321             0                75              1189                 0                 0
> 182785             0                39               690                 0                 0
> 219342             0                44               550                 0                 0
> 263210             0                69               551                 0                 0
> 315852             0                35               419                 0                 0
> 379022             0                35               564                 0                 0
> 454826             0                52               504                 0                 0
> 545791             0                79               749                 0                 0
> 654949             0                61               737                 0                 0
> 785939             0                30               399                 0                 0
> 943127             0                57               611                 0                 0
> 1131752            0                78               488                 0                 0
> 1358102            0                23               302                 0                 0
> 1629722            0                28               240                 0                 0
> 1955666            0                48               294                 0                 0
> 2346799            0                28               306                 0                 0
> 2816159            0                19               224                 0                 0
> 3379391            0                37               212                 0                 0
> 4055269            0                24               237                 0                 0
> 4866323            0                13               137                 0                 0
> 5839588            0                11                99                 0                 0
> 7007506            0                 4               115                 0                 0
> 8409007            0                16               194                 0                 0
> 10090808           0                12               156                 0                 0
> 12108970           0                12                54                 0                 0
> 14530764           0                24               147                 0                 0
> 17436917           0                10               114                 0                 0
> 20924300           0                 3                66                 0                 0
> 25109160           0                22               100                 0                 0+
> 
> Some of the write latencies looks really bad, but since they have column count 0 for most, I am not sure what to make of it.
> 
> 
> 
> How much is lots of data? 
> Lots of data might have been an exaggeration but the test is as described above. Each row read or written is about 1kb in size so each test run generates 4 (machines) * 8 (processes) * 2 (read and write) * 100 000 (rows) * 1kb (row size) = 6250 mb read or written (half of each)
> 
> Wide or skinny rows? 
> Skinny rows, only a single column is used for each row.
> 
> Mutations/sec ?
> The test when run on a freshly rebooted cassandra takes around 390 seconds, and 6400000 rows are read / written during this time period so around 16410 mutations / second (unless I missunderstood the word mutation).
> 
> Which Compaction Strategy are you using? 
> SizeTieredCompactionStrategy
> 
> Output of show schema (
> cassandra-cli ) for the relevant Keyspace/CF might help as well
> create column family rawData
>   with column_type = 'Standard'
>   and comparator = 'UTF8Type'
>   and default_validation_class = 'BytesType'
>   and key_validation_class = 'BytesType'
>   and read_repair_chance = 0.1
>   and dclocal_read_repair_chance = 0.0
>   and gc_grace = 864000
>   and min_compaction_threshold = 4
>   and max_compaction_threshold = 32
>   and replicate_on_write = true
>   and compaction_strategy = 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
>   and caching = 'KEYS_ONLY'
>   and column_metadata = [
>     {column_name : 'created',
>     validation_class : LongType},
>     {column_name : 'socketHash',
>     validation_class : Int32Type},
>     {column_name : 'data',
>     validation_class : UTF8Type},
>     {column_name : 'guid',
>     validation_class : UTF8Type},
>     {column_name : 'evaluated',
>     validation_class : Int32Type,
>     index_name : 'rawData_evaluated_idx_1',
>     index_type : 0}]
>   and compression_options = {'sstable_compression' : 'org.apache.cassandra.io.compress.SnappyCompressor'};
> 
> Only the "data" column is used during the test.
> 
> 
> 
> What consistency are you doing your writes with ?
> I am writing with consistency level ONE.
> 
> 
> What are the values for these settings in cassandra.yaml
> 
> memtable_total_space_in_mb: No value set in cassandra.yaml, so 1/3 of heap according to documentation (2gb / 3)
> memtable_flush_writers: No value set in cassandra.yaml, but only one data directory so I assume it is 1.
> memtable_flush_queue_size: 4
> compaction_throughput_mb_per_sec: 16
> concurrent_writes: 32
> 
> 
> Which version of Cassandra?
> 1.1.8
> 
> 
> Hope this helps you help me :)
> Best regards,
> Joel Samuelsson
> 
> 
> 2013/3/22 Joel Samuelsson <sa...@gmail.com>
> Thanks for the GC suggestion. It seems we didn't have enough CPU power to handle both the data and GC. Increasing the number of CPU cores made everything run smoothly at the same load.
> 
> 
> 2013/3/21 Andras Szerdahelyi <an...@ignitionone.com>
> Neat!
> Thanks.
> 
> From: Sylvain Lebresne <sy...@datastax.com>
> Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
> Date: Thursday 21 March 2013 10:10
> To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
> Subject: Re: Cassandra freezes
> 
> Prior to 1.2 the index summaries were not saved on disk, and were thus computed on startup while the sstable was loaded. In 1.2 they now are saved on disk to make startup faster (https://issues.apache.org/jira/browse/CASSANDRA-2392). That being said, if the index_interval value used by a summary saved doesn't match the current one while the sstable is loaded, the summary is recomputed anyway, so restarting a node should always take a new index_interval setting into account.
> 
> --
> Sylvain
> 
> 
> On Thu, Mar 21, 2013 at 9:43 AM, Andras Szerdahelyi <an...@ignitionone.com> wrote:
> I can not find the reference that notes having to upgradesstables when you
> change this. I really hope such complex assumptions are not formulating in
> my head just on their own and there actually exists some kind of reliable
> reference that clears this up :-) but,
> 
> # index_interval controls the sampling of entries from the primrary
> # row index in terms of space versus time. The larger the interval,
> # the smaller and less effective the sampling will be. In technicial
> # terms, the interval coresponds to the number of index entries that
> # are skipped between taking each sample. All the sampled entries
> # must fit in memory. Generally, a value between 128 and 512 here
> # coupled with a large key cache size on CFs results in the best trade
> # offs. This value is not often changed, however if you have many
> # very small rows (many to an OS page), then increasing this will
> # often lower memory usage without a impact on performance.
> 
> it is ( very ) safe to assume the row index is re-built/updated when new
> sstables are built.
> Obviously the sample of this index will have to follow this process very
> closely.
> 
> It is possible however that the sample itself is not persisted and is
> built at startup as opposed to *only* when the index changes.( which is
> what I thought was happening )
> It shouldn't be too difficult to verify this, but I'd appreciate if
> someone who looked at this before could confirm if this is the case.
> 
> Thanks,
> Andras
> 
> On 21/03/13 09:13, "Michal Michalski" <mi...@opera.com> wrote:
> 
> >About index_interval:
> >
> >> 1) you have to rebuild stables ( not an issue if you are evaluating,
> >>doing
> >> test writes.. Etc, not so much in production )
> >
> >Are you sure of this? As I understand indexes, it's not required because
> >this parameter defines an interval of in-memory index sample, which is
> >created during C* startup basing on a primary on-disk index file. The
> >fact that Heap usage is reduced immediately after C* restart seem to
> >confirm this, but maybe I miss something?
> >
> >M.
> 
> 
> 
> 
>

Re: Cassandra freezes

Posted by Joel Samuelsson <sa...@gmail.com>.

Thanks for your suggestions. I'll get back to you with the tests you
suggested, probably tomorrow. In the meantime though I have a few
questions. You say:
- 2GB of JVM heap to be insufficient to run this workload against Cassandra

I realise that the one node cluster has a maximum workload. It did however
work fine for a few tests and then performance deteriorated. Currently I
can't even complete a test run since the server won't respond in time -
even though I haven't run a test since yesterday. Shouldn't the server
"recover" sooner or later and perform as well as it did during the first
few tests? If not automatically, what can I do to help it? Tried nodetool
flush but with no performance improvement.

And just fyi in case it changes anything, I don't immediately read back the
written rows. There are 100 000 rows being written and 100 000 rows being
read in parallell. The rows being read were written to the cluster before
the tests were run.


2013/4/3 Andras Szerdahelyi <an...@ignitionone.com>

>  Wrt/ cfhistograms output, you are supposed to consider "Offset" x
> "column values of each column" a separate histogram. Also AFAIK, these
> counters are flushed after each invocation, so you are always looking at
> data from between two invocations of cfhistograms   - With that in mind, to
> me your cfhistograms say :
>  - you encountered 200k rows with a single column
>
>  - most of your write latencies are agreeable but – and I can not comment
> on how much a commit log write ( an append ) would affect this as I have
> durable_writes:false on all my data - that’s a long tail you have there, in
> to the hundreds of milliseconds which can not be OK.
>  Depending on how often your memtables are switched ( emptied and queued
> for flushing to disk ) and how valuable your updates received in between
> two of these are, you may want to disable durable writes on the KS with
> "durable_writes=false", or the very least place the commit log folder on
> its own drive. Again, I'm not absolutely sure this could affect write
> latency
>
>  - 38162520 reads served with a single sstable read, that’s great
>
>  - a big chunk of these reads are served from page cache or memtables (
> the latter being more likely since, as I understand , you immediately read
> back the written column and you work with unique row keys ) , but again
> you have a long drop off
>
>  16410 mutations / sec, with 1k payload, lets say that is 20MB/s in to
> memory with overhead, 3rd of the 2GB heap for memtables = 666MB : a switch
> every ~30 seconds.
> I'm not sure if your write performance can be attributed to GC only, so
> can you sit through one of these tests with :
>
>  watch -n2 iostat –xmd devicename ( look for avg–qusz and what your peak
> write throughput is )
> watch -n2 nodetool tpstats ( see if you have flushes blocked. )
>
>  In the end I think you'll either find
>  - 2GB of JVM heap to be insufficient to run this workload against
> Cassandra
>  - or your single disk serving your data directory being unable to keep up
> with having to flush 20MB/s sustained write every 30 seconds ( unlikely
> unless you are on EC2 EBS drives )
>
>  Also, just be sure: restart cassandra before the test and confirm your
> benchmark application is doing what you think its doing in terms of
> reads/writes with nodetool cfstats
>
>  Regards,
>  Andras
>
>   From: Joel Samuelsson <sa...@gmail.com>
> Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
> Date: Wednesday 3 April 2013 11:55
>
> To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
> Subject: Re: Cassandra freezes
>
>   It seems this problem is back and I am unsure how to solve it. I have a
> test setup like this:
> 4 machines run 8 processess each. Each process has 2 threads, 1 for
> writing 100 000 rows and one for reading another 100 000 rows. Each machine
> (and process) read and write the exact same rows so it is essentially the
> same 200 000 rows being read / written.
> The cassandra cluster is a one node cluster.
> The first 10-20 runs of the test described above goes smoothly, after that
> tests take increasingly long time with GC happening almost all the time.
>
>  Here is my CASS-FREEZE-001 form answers:
>
>  How big is your JVM heap ?
> 2GB
>
>  How many CPUs ?
> A virtual environment so I can't be perfectly sure but according to their
> specification, "8 cores".
>
> Garbage collection taking long ? ( look for log lines from GCInspector)
> Yes, these are a few lines seen during 1 test run:
> INFO [ScheduledTasks:1] 2013-04-03 08:47:40,757 GCInspector.java (line
> 122) GC for ParNew: 40370 ms for 3 collections, 565045688 used; max is
> 2038693888
>  INFO [ScheduledTasks:1] 2013-04-03 08:48:24,720 GCInspector.java (line
> 122) GC for ParNew: 39840 ms for 2 collections, 614065528 used; max is
> 2038693888
> INFO [ScheduledTasks:1] 2013-04-03 08:49:09,319 GCInspector.java (line
> 122) GC for ParNew: 37666 ms for 2 collections, 682352952 used; max is
> 2038693888
>  INFO [ScheduledTasks:1] 2013-04-03 08:50:02,577 GCInspector.java (line
> 122) GC for ParNew: 44590 ms for 1 collections, 792861352 used; max is
> 2038693888
>
>
>  Running out of heap ? ( "heap is .. full" log lines )
> Yes. Same run as above:
> WARN [ScheduledTasks:1] 2013-04-03 08:54:35,108 GCInspector.java (line
> 139) Heap is 0.8596674853032178 full.  You may need to reduce memtable
> and/or cache sizes.  Cassandra is now reducing cache sizes to free up
> memory.  Adjust reduce_cache_sizes_at threshold in cassandra.yaml if you
> don't want Cassandra to do this automatically
>  WARN [ScheduledTasks:1] 2013-04-03 08:54:36,831 GCInspector.java (line
> 145) Heap is 0.8596674853032178 full.  You may need to reduce memtable
> and/or cache sizes.  Cassandra will now flush up to the two largest
> memtables to free up memory.  Adjust flush_largest_memtables_at threshold
> in cassandra.yaml if you don't want Cassandra to do this automatically
>
>
> Any tasks backing up / being dropped ? ( nodetool tpstats and ".. dropped
> in last .. ms" log lines )
> Yes. Same run as above:
> INFO [ScheduledTasks:1] 2013-04-03 08:52:04,943 MessagingService.java
> (line 673) 31 MUTATION messages dropped in last 5000ms
>  INFO [ScheduledTasks:1] 2013-04-03 08:52:04,944 MessagingService.java
> (line 673) 8 READ messages dropped in last 5000ms
>
> Are writes really slow? ( nodetool cfhistograms Keyspace ColumnFamily )
> Not sure how to interpret the output of nodetool cfhistograms, but here it
> is (I hope it's fairly readable):
>  Offset      SSTables     Write Latency      Read Latency          Row
> Size      Column Count
> 1           38162520                 0                 0                 0
>            200000
> 2                  0                22                 0                 0
>                 0
> 3                  0              1629                 0                 0
>                 0
> 4                  0              9990                 0                 0
>                 0
> 5                  0             40169                 0                 0
>                 0
> 6                  0            161538                 0                 0
>                 0
> 7                  0            487266                 0                 0
>                 0
> 8                  0           1096601                 0                 0
>                 0
> 10                 0           4842978                 0                 0
>                 0
> 12                 0           7976003                 0                 0
>                 0
> 14                 0           8673230                 0                 0
>                 0
> 17                 0           9805730                 0                 0
>                 0
> 20                 0           5083707                 0                 0
>                 0
> 24                 0           2541157                 0                 0
>                 0
> 29                 0            768916                 0                 0
>                 0
> 35                 0            220440                 0                 0
>                 0
> 42                 0            112915                 0                 0
>                 0
> 50                 0             71469                 0                 0
>                 0
> 60                 0             48909                 0                 0
>                 0
> 72                 0             50714                 0                 0
>                 0
> 86                 0             45390                 0                 0
>                 0
> 103                0             41975                 0                 0
>                 0
> 124                0             40371                 0                 0
>                 0
> 149                0             37103                 0                 0
>                 0
> 179                0             44631                 0                 0
>                 0
> 215                0             43957                 0                 0
>                 0
> 258                0             32499                 1                 0
>                 0
> 310                0             18446          23056779                 0
>                 0
> 372                0             13113          12580639                 0
>                 0
> 446                0              9862           1017347                 0
>                 0
> 535                0              7480            784506                 0
>                 0
> 642                0              5473            274274                 0
>                 0
> 770                0              4084             56379                 0
>                 0
> 924                0              3046             27979                 0
>                 0
> 1109               0              2205             20206            200000
>                 0
> 1331               0              1658             16947                 0
>                 0
> 1597               0              1228             16969                 0
>                 0
> 1916               0               896             15848                 0
>                 0
> 2299               0               542             13928                 0
>                 0
> 2759               0               379             11782                 0
>                 0
> 3311               0               326              9761                 0
>                 0
> 3973               0               540              8997                 0
>                 0
> 4768               0               450              7938                 0
>                 0
> 5722               0               270              6552                 0
>                 0
> 6866               0               170              6022                 0
>                 0
> 8239               0               146              6474                 0
>                 0
> 9887               0               166              7969                 0
>                 0
> 11864              0               176             53725                 0
>                 0
> 14237              0               203             10260                 0
>                 0
> 17084              0               255              6827                 0
>                 0
> 20501              0               312             27462                 0
>                 0
> 24601              0               445             11523                 0
>                 0
> 29521              0               736              9904                 0
>                 0
> 35425              0               909             20539                 0
>                 0
> 42510              0               896             14280                 0
>                 0
> 51012              0               904             12443                 0
>                 0
> 61214              0               715             11956                 0
>                 0
> 73457              0               652             10040                 0
>                 0
> 88148              0               474              7992                 0
>                 0
> 105778             0               256              5043                 0
>                 0
> 126934             0               113              2370                 0
>                 0
> 152321             0                75              1189                 0
>                 0
> 182785             0                39               690                 0
>                 0
> 219342             0                44               550                 0
>                 0
> 263210             0                69               551                 0
>                 0
> 315852             0                35               419                 0
>                 0
> 379022             0                35               564                 0
>                 0
> 454826             0                52               504                 0
>                 0
> 545791             0                79               749                 0
>                 0
> 654949             0                61               737                 0
>                 0
> 785939             0                30               399                 0
>                 0
> 943127             0                57               611                 0
>                 0
> 1131752            0                78               488                 0
>                 0
> 1358102            0                23               302                 0
>                 0
> 1629722            0                28               240                 0
>                 0
> 1955666            0                48               294                 0
>                 0
> 2346799            0                28               306                 0
>                 0
> 2816159            0                19               224                 0
>                 0
> 3379391            0                37               212                 0
>                 0
> 4055269            0                24               237                 0
>                 0
> 4866323            0                13               137                 0
>                 0
> 5839588            0                11                99                 0
>                 0
> 7007506            0                 4               115                 0
>                 0
> 8409007            0                16               194                 0
>                 0
> 10090808           0                12               156                 0
>                 0
> 12108970           0                12                54                 0
>                 0
> 14530764           0                24               147                 0
>                 0
> 17436917           0                10               114                 0
>                 0
> 20924300           0                 3                66                 0
>                 0
> 25109160           0                22               100                 0
>                 0+
>
>  Some of the write latencies looks really bad, but since they have column
> count 0 for most, I am not sure what to make of it.
>
>
>
> How much is lots of data?
> Lots of data might have been an exaggeration but the test is as described
> above. Each row read or written is about 1kb in size so each test run
> generates 4 (machines) * 8 (processes) * 2 (read and write) * 100 000
> (rows) * 1kb (row size) = 6250 mb read or written (half of each)
>
>  Wide or skinny rows?
> Skinny rows, only a single column is used for each row.
>
>  Mutations/sec ?
> The test when run on a freshly rebooted cassandra takes around 390
> seconds, and 6400000 rows are read / written during this time period so
> around 16410 mutations / second (unless I missunderstood the word mutation).
>
> Which Compaction Strategy are you using?
> SizeTieredCompactionStrategy
>
>  Output of show schema (
> cassandra-cli ) for the relevant Keyspace/CF might help as well
>  create column family rawData
>   with column_type = 'Standard'
>   and comparator = 'UTF8Type'
>   and default_validation_class = 'BytesType'
>   and key_validation_class = 'BytesType'
>   and read_repair_chance = 0.1
>   and dclocal_read_repair_chance = 0.0
>   and gc_grace = 864000
>   and min_compaction_threshold = 4
>   and max_compaction_threshold = 32
>   and replicate_on_write = true
>   and compaction_strategy =
> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
>   and caching = 'KEYS_ONLY'
>   and column_metadata = [
>     {column_name : 'created',
>     validation_class : LongType},
>     {column_name : 'socketHash',
>     validation_class : Int32Type},
>     {column_name : 'data',
>     validation_class : UTF8Type},
>     {column_name : 'guid',
>     validation_class : UTF8Type},
>     {column_name : 'evaluated',
>     validation_class : Int32Type,
>     index_name : 'rawData_evaluated_idx_1',
>     index_type : 0}]
>   and compression_options = {'sstable_compression' :
> 'org.apache.cassandra.io.compress.SnappyCompressor'};
>
>  Only the "data" column is used during the test.
>
>
>
> What consistency are you doing your writes with ?
> I am writing with consistency level ONE.
>
>
> What are the values for these settings in cassandra.yaml
>
> memtable_total_space_in_mb: No value set in cassandra.yaml, so 1/3 of heap
> according to documentation (2gb / 3)
> memtable_flush_writers: No value set in cassandra.yaml, but only one data
> directory so I assume it is 1.
> memtable_flush_queue_size: 4
> compaction_throughput_mb_per_sec: 16
> concurrent_writes: 32
>
>
> Which version of Cassandra?
>  1.1.8
>
>
>  Hope this helps you help me :)
> Best regards,
> Joel Samuelsson
>
>
> 2013/3/22 Joel Samuelsson <sa...@gmail.com>
>
>> Thanks for the GC suggestion. It seems we didn't have enough CPU power to
>> handle both the data and GC. Increasing the number of CPU cores made
>> everything run smoothly at the same load.
>>
>>
>> 2013/3/21 Andras Szerdahelyi <an...@ignitionone.com>
>>
>>>  Neat!
>>> Thanks.
>>>
>>>   From: Sylvain Lebresne <sy...@datastax.com>
>>> Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
>>> Date: Thursday 21 March 2013 10:10
>>> To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
>>> Subject: Re: Cassandra freezes
>>>
>>>   Prior to 1.2 the index summaries were not saved on disk, and were
>>> thus computed on startup while the sstable was loaded. In 1.2 they now are
>>> saved on disk to make startup faster (
>>> https://issues.apache.org/jira/browse/CASSANDRA-2392). That being said,
>>> if the index_interval value used by a summary saved doesn't match the
>>> current one while the sstable is loaded, the summary is recomputed anyway,
>>> so restarting a node should always take a new index_interval setting into
>>> account.
>>>
>>>  --
>>> Sylvain
>>>
>>>
>>> On Thu, Mar 21, 2013 at 9:43 AM, Andras Szerdahelyi <
>>> andras.szerdahelyi@ignitionone.com> wrote:
>>>
>>>> I can not find the reference that notes having to upgradesstables when
>>>> you
>>>> change this. I really hope such complex assumptions are not formulating
>>>> in
>>>> my head just on their own and there actually exists some kind of
>>>> reliable
>>>> reference that clears this up :-) but,
>>>>
>>>> # index_interval controls the sampling of entries from the primrary
>>>> # row index in terms of space versus time. The larger the interval,
>>>> # the smaller and less effective the sampling will be. In technicial
>>>> # terms, the interval coresponds to the number of index entries that
>>>> # are skipped between taking each sample. All the sampled entries
>>>> # must fit in memory. Generally, a value between 128 and 512 here
>>>> # coupled with a large key cache size on CFs results in the best trade
>>>> # offs. This value is not often changed, however if you have many
>>>> # very small rows (many to an OS page), then increasing this will
>>>> # often lower memory usage without a impact on performance.
>>>>
>>>> it is ( very ) safe to assume the row index is re-built/updated when new
>>>> sstables are built.
>>>> Obviously the sample of this index will have to follow this process very
>>>> closely.
>>>>
>>>> It is possible however that the sample itself is not persisted and is
>>>> built at startup as opposed to *only* when the index changes.( which is
>>>> what I thought was happening )
>>>> It shouldn't be too difficult to verify this, but I'd appreciate if
>>>> someone who looked at this before could confirm if this is the case.
>>>>
>>>> Thanks,
>>>> Andras
>>>>
>>>> On 21/03/13 09:13, "Michal Michalski" <mi...@opera.com> wrote:
>>>>
>>>> >About index_interval:
>>>> >
>>>> >> 1) you have to rebuild stables ( not an issue if you are evaluating,
>>>> >>doing
>>>> >> test writes.. Etc, not so much in production )
>>>> >
>>>> >Are you sure of this? As I understand indexes, it's not required
>>>> because
>>>> >this parameter defines an interval of in-memory index sample, which is
>>>> >created during C* startup basing on a primary on-disk index file. The
>>>> >fact that Heap usage is reduced immediately after C* restart seem to
>>>> >confirm this, but maybe I miss something?
>>>> >
>>>> >M.
>>>>
>>>>
>>>
>>
>

Re: Cassandra freezes

Posted by Andras Szerdahelyi <an...@ignitionone.com>.

Wrt/ cfhistograms output, you are supposed to consider "Offset" x "column values of each column" a separate histogram. Also AFAIK, these counters are flushed after each invocation, so you are always looking at data from between two invocations of cfhistograms   - With that in mind, to me your cfhistograms say :
- you encountered 200k rows with a single column

- most of your write latencies are agreeable but – and I can not comment on how much a commit log write ( an append ) would affect this as I have durable_writes:false on all my data - that’s a long tail you have there, in to the hundreds of milliseconds which can not be OK.
Depending on how often your memtables are switched ( emptied and queued for flushing to disk ) and how valuable your updates received in between two of these are, you may want to disable durable writes on the KS with "durable_writes=false", or the very least place the commit log folder on its own drive. Again, I'm not absolutely sure this could affect write latency

- 38162520 reads served with a single sstable read, that’s great

- a big chunk of these reads are served from page cache or memtables ( the latter being more likely since, as I understand , you immediately read back the written column and you work with unique row keys ) , but again you have a long drop off

16410 mutations / sec, with 1k payload, lets say that is 20MB/s in to memory with overhead, 3rd of the 2GB heap for memtables = 666MB : a switch every ~30 seconds.
I'm not sure if your write performance can be attributed to GC only, so can you sit through one of these tests with :

watch -n2 iostat –xmd devicename ( look for avg–qusz and what your peak write throughput is )
watch -n2 nodetool tpstats ( see if you have flushes blocked. )

In the end I think you'll either find
- 2GB of JVM heap to be insufficient to run this workload against Cassandra
- or your single disk serving your data directory being unable to keep up with having to flush 20MB/s sustained write every 30 seconds ( unlikely unless you are on EC2 EBS drives )

Also, just be sure: restart cassandra before the test and confirm your benchmark application is doing what you think its doing in terms of reads/writes with nodetool cfstats

Regards,
Andras

From: Joel Samuelsson <sa...@gmail.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Wednesday 3 April 2013 11:55
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Cassandra freezes

It seems this problem is back and I am unsure how to solve it. I have a test setup like this:
4 machines run 8 processess each. Each process has 2 threads, 1 for writing 100 000 rows and one for reading another 100 000 rows. Each machine (and process) read and write the exact same rows so it is essentially the same 200 000 rows being read / written.
The cassandra cluster is a one node cluster.
The first 10-20 runs of the test described above goes smoothly, after that tests take increasingly long time with GC happening almost all the time.

Here is my CASS-FREEZE-001 form answers:

How big is your JVM heap ?
2GB

How many CPUs ?
A virtual environment so I can't be perfectly sure but according to their specification, "8 cores".

Garbage collection taking long ? ( look for log lines from GCInspector)
Yes, these are a few lines seen during 1 test run:
INFO [ScheduledTasks:1] 2013-04-03 08:47:40,757 GCInspector.java (line 122) GC for ParNew: 40370 ms for 3 collections, 565045688 used; max is 2038693888<tel:2038693888>
INFO [ScheduledTasks:1] 2013-04-03 08:48:24,720 GCInspector.java (line 122) GC for ParNew: 39840 ms for 2 collections, 614065528 used; max is 2038693888<tel:2038693888>
INFO [ScheduledTasks:1] 2013-04-03 08:49:09,319 GCInspector.java (line 122) GC for ParNew: 37666 ms for 2 collections, 682352952 used; max is 2038693888<tel:2038693888>
INFO [ScheduledTasks:1] 2013-04-03 08:50:02,577 GCInspector.java (line 122) GC for ParNew: 44590 ms for 1 collections, 792861352 used; max is 2038693888<tel:2038693888>

Running out of heap ? ( "heap is .. full" log lines )
Yes. Same run as above:
WARN [ScheduledTasks:1] 2013-04-03 08:54:35,108 GCInspector.java (line 139) Heap is 0.8596674853032178 full.  You may need to reduce memtable and/or cache sizes.  Cassandra is now reducing cache sizes to free up memory.  Adjust reduce_cache_sizes_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically
WARN [ScheduledTasks:1] 2013-04-03 08:54:36,831 GCInspector.java (line 145) Heap is 0.8596674853032178 full.  You may need to reduce memtable and/or cache sizes.  Cassandra will now flush up to the two largest memtables to free up memory.  Adjust flush_largest_memtables_at threshold in cassandra.yaml if you don't want Cassandra to do this automatically

Any tasks backing up / being dropped ? ( nodetool tpstats and ".. dropped
in last .. ms" log lines )
Yes. Same run as above:
INFO [ScheduledTasks:1] 2013-04-03 08:52:04,943 MessagingService.java (line 673) 31 MUTATION messages dropped in last 5000ms
INFO [ScheduledTasks:1] 2013-04-03 08:52:04,944 MessagingService.java (line 673) 8 READ messages dropped in last 5000ms

Are writes really slow? ( nodetool cfhistograms Keyspace ColumnFamily )
Not sure how to interpret the output of nodetool cfhistograms, but here it is (I hope it's fairly readable):
Offset      SSTables     Write Latency      Read Latency          Row Size      Column Count
1           38162520                 0                 0                 0            200000
2                  0                22                 0                 0                 0
3                  0              1629                 0                 0                 0
4                  0              9990                 0                 0                 0
5                  0             40169                 0                 0                 0
6                  0            161538                 0                 0                 0
7                  0            487266                 0                 0                 0
8                  0           1096601                 0                 0                 0
10                 0           4842978                 0                 0                 0
12                 0           7976003                 0                 0                 0
14                 0           8673230                 0                 0                 0
17                 0           9805730                 0                 0                 0
20                 0           5083707                 0                 0                 0
24                 0           2541157                 0                 0                 0
29                 0            768916                 0                 0                 0
35                 0            220440                 0                 0                 0
42                 0            112915                 0                 0                 0
50                 0             71469                 0                 0                 0
60                 0             48909                 0                 0                 0
72                 0             50714                 0                 0                 0
86                 0             45390                 0                 0                 0
103                0             41975                 0                 0                 0
124                0             40371                 0                 0                 0
149                0             37103                 0                 0                 0
179                0             44631                 0                 0                 0
215                0             43957                 0                 0                 0
258                0             32499                 1                 0                 0
310                0             18446          23056779                 0                 0
372                0             13113          12580639                 0                 0
446                0              9862           1017347                 0                 0
535                0              7480            784506                 0                 0
642                0              5473            274274                 0                 0
770                0              4084             56379                 0                 0
924                0              3046             27979                 0                 0
1109               0              2205             20206            200000                 0
1331               0              1658             16947                 0                 0
1597               0              1228             16969                 0                 0
1916               0               896             15848                 0                 0
2299               0               542             13928                 0                 0
2759               0               379             11782                 0                 0
3311               0               326              9761                 0                 0
3973               0               540              8997                 0                 0
4768               0               450              7938                 0                 0
5722               0               270              6552                 0                 0
6866               0               170              6022                 0                 0
8239               0               146              6474                 0                 0
9887               0               166              7969                 0                 0
11864              0               176             53725                 0                 0
14237              0               203             10260                 0                 0
17084              0               255              6827                 0                 0
20501              0               312             27462                 0                 0
24601              0               445             11523                 0                 0
29521              0               736              9904                 0                 0
35425              0               909             20539                 0                 0
42510              0               896             14280                 0                 0
51012              0               904             12443                 0                 0
61214              0               715             11956                 0                 0
73457              0               652             10040                 0                 0
88148              0               474              7992                 0                 0
105778             0               256              5043                 0                 0
126934             0               113              2370                 0                 0
152321             0                75              1189                 0                 0
182785             0                39               690                 0                 0
219342             0                44               550                 0                 0
263210             0                69               551                 0                 0
315852             0                35               419                 0                 0
379022             0                35               564                 0                 0
454826             0                52               504                 0                 0
545791             0                79               749                 0                 0
654949             0                61               737                 0                 0
785939             0                30               399                 0                 0
943127             0                57               611                 0                 0
1131752            0                78               488                 0                 0
1358102            0                23               302                 0                 0
1629722            0                28               240                 0                 0
1955666            0                48               294                 0                 0
2346799            0                28               306                 0                 0
2816159            0                19               224                 0                 0
3379391            0                37               212                 0                 0
4055269            0                24               237                 0                 0
4866323            0                13               137                 0                 0
5839588            0                11                99                 0                 0
7007506            0                 4               115                 0                 0
8409007            0                16               194                 0                 0
10090808           0                12               156                 0                 0
12108970           0                12                54                 0                 0
14530764           0                24               147                 0                 0
17436917           0                10               114                 0                 0
20924300           0                 3                66                 0                 0
25109160           0                22               100                 0                 0+

Some of the write latencies looks really bad, but since they have column count 0 for most, I am not sure what to make of it.

How much is lots of data?
Lots of data might have been an exaggeration but the test is as described above. Each row read or written is about 1kb in size so each test run generates 4 (machines) * 8 (processes) * 2 (read and write) * 100 000 (rows) * 1kb (row size) = 6250 mb read or written (half of each)

Wide or skinny rows?
Skinny rows, only a single column is used for each row.

Mutations/sec ?
The test when run on a freshly rebooted cassandra takes around 390 seconds, and 6400000 rows are read / written during this time period so around 16410 mutations / second (unless I missunderstood the word mutation).

Which Compaction Strategy are you using?
SizeTieredCompactionStrategy

Output of show schema (
cassandra-cli ) for the relevant Keyspace/CF might help as well
create column family rawData
  with column_type = 'Standard'
  and comparator = 'UTF8Type'
  and default_validation_class = 'BytesType'
  and key_validation_class = 'BytesType'
  and read_repair_chance = 0.1
  and dclocal_read_repair_chance = 0.0
  and gc_grace = 864000
  and min_compaction_threshold = 4
  and max_compaction_threshold = 32
  and replicate_on_write = true
  and compaction_strategy = 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'
  and caching = 'KEYS_ONLY'
  and column_metadata = [
    {column_name : 'created',
    validation_class : LongType},
    {column_name : 'socketHash',
    validation_class : Int32Type},
    {column_name : 'data',
    validation_class : UTF8Type},
    {column_name : 'guid',
    validation_class : UTF8Type},
    {column_name : 'evaluated',
    validation_class : Int32Type,
    index_name : 'rawData_evaluated_idx_1',
    index_type : 0}]
  and compression_options = {'sstable_compression' : 'org.apache.cassandra.io.compress.SnappyCompressor'};

Only the "data" column is used during the test.

What consistency are you doing your writes with ?
I am writing with consistency level ONE.

What are the values for these settings in cassandra.yaml

memtable_total_space_in_mb: No value set in cassandra.yaml, so 1/3 of heap according to documentation (2gb / 3)
memtable_flush_writers: No value set in cassandra.yaml, but only one data directory so I assume it is 1.
memtable_flush_queue_size: 4
compaction_throughput_mb_per_sec: 16
concurrent_writes: 32

Which version of Cassandra?
1.1.8

Hope this helps you help me :)
Best regards,
Joel Samuelsson

2013/3/22 Joel Samuelsson <sa...@gmail.com>>
Thanks for the GC suggestion. It seems we didn't have enough CPU power to handle both the data and GC. Increasing the number of CPU cores made everything run smoothly at the same load.

2013/3/21 Andras Szerdahelyi <an...@ignitionone.com>>
Neat!
Thanks.

From: Sylvain Lebresne <sy...@datastax.com>>
Reply-To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Date: Thursday 21 March 2013 10:10
To: "user@cassandra.apache.org<ma...@cassandra.apache.org>" <us...@cassandra.apache.org>>
Subject: Re: Cassandra freezes

Prior to 1.2 the index summaries were not saved on disk, and were thus computed on startup while the sstable was loaded. In 1.2 they now are saved on disk to make startup faster (https://issues.apache.org/jira/browse/CASSANDRA-2392). That being said, if the index_interval value used by a summary saved doesn't match the current one while the sstable is loaded, the summary is recomputed anyway, so restarting a node should always take a new index_interval setting into account.

--
Sylvain

On Thu, Mar 21, 2013 at 9:43 AM, Andras Szerdahelyi <an...@ignitionone.com>> wrote:
I can not find the reference that notes having to upgradesstables when you
change this. I really hope such complex assumptions are not formulating in
my head just on their own and there actually exists some kind of reliable
reference that clears this up :-) but,

# index_interval controls the sampling of entries from the primrary
# row index in terms of space versus time. The larger the interval,
# the smaller and less effective the sampling will be. In technicial
# terms, the interval coresponds to the number of index entries that
# are skipped between taking each sample. All the sampled entries
# must fit in memory. Generally, a value between 128 and 512 here
# coupled with a large key cache size on CFs results in the best trade
# offs. This value is not often changed, however if you have many
# very small rows (many to an OS page), then increasing this will
# often lower memory usage without a impact on performance.

it is ( very ) safe to assume the row index is re-built/updated when new
sstables are built.
Obviously the sample of this index will have to follow this process very
closely.

It is possible however that the sample itself is not persisted and is
built at startup as opposed to *only* when the index changes.( which is
what I thought was happening )
It shouldn't be too difficult to verify this, but I'd appreciate if
someone who looked at this before could confirm if this is the case.

Thanks,
Andras

On 21/03/13 09:13, "Michal Michalski" <mi...@opera.com>> wrote:

>About index_interval:
>
>> 1) you have to rebuild stables ( not an issue if you are evaluating,
>>doing
>> test writes.. Etc, not so much in production )
>
>Are you sure of this? As I understand indexes, it's not required because
>this parameter defines an interval of in-memory index sample, which is
>created during C* startup basing on a primary on-disk index file. The
>fact that Heap usage is reduced immediately after C* restart seem to
>confirm this, but maybe I miss something?
>
>M.