You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by David Hawthorne <dh...@gmx.3crowd.com> on 2011/07/12 08:57:15 UTC

ReplicateOnWrite issues

This is now my fourth attempt to get the message through.  Apologies if you see multiple copies.

I've tried to give as much relevant data as I can think of, but please let me know if you need any other info.  I spent the day getting jmxtrans to talk to statsd with the cassandra JMX data, so I can provide graphs of just about anything you want.  I can also re-run the tests to replicate the problem if need be.  As always, thanks for taking a look!

I have a 4 host test cluster that I'm writing counters into running 0.8.1, and I'm writing to it with hector with CL.ONE or CL.QUORUM (test was run with each and both gave the same results).  Partitioner is the RandomPartitioner, and RF=3.  nodetool ring shows that actual data is well balanced:

Address         DC          Rack        Status State   Load            Owns    Token                                       
                                                                              127605887595351923798765477786913079296     
10.0.0.57    datacenter1 rack1       Up     Normal  1.59 GB         25.00%  0                                           
10.0.0.56    datacenter1 rack1       Up     Normal  1.39 GB         25.00%  42535295865117307932921825928971026432      
10.0.0.55    datacenter1 rack1       Up     Normal  1.24 GB         25.00%  85070591730234615865843651857942052864      
10.0.0.54    datacenter1 rack1       Up     Normal  1.34 GB         25.00%  127605887595351923798765477786913079296     

I also tested against a 1-box test cluster with RF=1.

Both show the same results:  high performance for a while, and then ReplicateOnWrite backs way up (I've seen it as high as a million), the cassandra process becomes unresponsive, and the hector client starts throwing exceptions.  High performance in this case equates to about 5-20k inserts/sec on both the 4-box and 1-box cluster (I'm inserting one long row at a time, so the performance is bounded by the one box owning the row).  The cassandra process never seems to recover, even if left for 24 hours.  It's still chewing through those ReplicateOnWrite pending tasks.

GC doesn't seem to be a factor.  Logs show GCs completing in < 1s on all 4 boxes in the test cluster, as well as on the 1 standalone server.

I'm also seeing highly erratic performance from the cluster in general:  inserts/sec usually start at around 3k, but then maybe creep up towards 6k, then drop back down to below 1k for an extended period of time, then maybe 1% of the time they'll spike up to 8k or even 15k, but that doesn't last more than 30 seconds.

Here's the tpstats output from all 4 boxes:

Slowest/most problematic/highest cpu usage one:

This tpstats output is also representative of the one box cluster about 20 minutes into my insert run.

4 $ ./nodetool -h localhost tpstats
Pool Name                    Active   Pending      Completed
ReadStage                         0         0              0
RequestResponseStage              0         0       59095840
MutationStage                     0         0       31182942
ReadRepairStage                   0         0              0
ReplicateOnWriteStage            32    928042       28246716
GossipStage                       1         1         449464
AntiEntropyStage                  0         0              0
MigrationStage                    0         0              4
MemtablePostFlusher               0         0            445
StreamStage                       0         0              0
FlushWriter                       0         0            445
FILEUTILS-DELETE-POOL             0         0             38
MiscStage                         0         0              0
FlushSorter                       0         0              0
InternalResponseStage             0         0              4
HintedHandoff                     0         0              0

Second highest cpu usage one:

3 $ ./nodetool -h localhost tpstats
Pool Name                    Active   Pending      Completed
ReadStage                         0         0              0
RequestResponseStage              0         0       27413910
MutationStage                     0         0       27523094
ReadRepairStage                   0         0              0
ReplicateOnWriteStage             0         0       11111971
GossipStage                       0         0         335422
AntiEntropyStage                  0         0              0
MigrationStage                    0         0              4
MemtablePostFlusher               0         0             66
StreamStage                       0         0              0
FlushWriter                       0         0             66
FILEUTILS-DELETE-POOL             0         0             31
MiscStage                         0         0              0
FlushSorter                       0         0              0
InternalResponseStage             0         0              4
HintedHandoff                     0         0              0

The other 2 that see minimal cpu usage:

2 $ ./nodetool -h localhost tpstats
Pool Name                    Active   Pending      Completed
ReadStage                         0         0              0
RequestResponseStage              0         0        9213999
MutationStage                     0         0       33233114
ReadRepairStage                   0         0              0
ReplicateOnWriteStage             0         0              0
GossipStage                       0         0         336016
AntiEntropyStage                  0         0              0
MigrationStage                    0         0              4
MemtablePostFlusher               0         0             77
StreamStage                       0         0              0
FlushWriter                       0         0             77
FILEUTILS-DELETE-POOL             0         0             40
MiscStage                         0         0              0
FlushSorter                       0         0              0
InternalResponseStage             0         0              4
HintedHandoff                     0         0              0


1 $ ./nodetool -h localhost tpstats
Pool Name                    Active   Pending      Completed
ReadStage                         0         0              0
RequestResponseStage              0         0       13796875
MutationStage                     0         0       36116662
ReadRepairStage                   0         0              0
ReplicateOnWriteStage             0         0              0
GossipStage                       0         0         336496
AntiEntropyStage                  0         0              0
MigrationStage                    0         0              4
MemtablePostFlusher               0         0             91
StreamStage                       0         0              0
FlushWriter                       0         0             91
FILEUTILS-DELETE-POOL             0         0             43
MiscStage                         0         0              0
FlushSorter                       0         0              0
InternalResponseStage             0         0              4
HintedHandoff                     0         0              0

Re: ReplicateOnWrite issues

Posted by Sylvain Lebresne <sy...@datastax.com>.

On Wed, Jul 13, 2011 at 1:00 AM, David Hawthorne <dh...@gmx.3crowd.com> wrote:
> Thanks for looking at that.
>
> Our use case involves supercolumns that have 2-20,000 counters within them.  For a set of continuous updates to one supercolumn, the behavior you're describing is:

Here's your problem. Don't do that. It's not a counter issue, it's a
superColumn limitation. You should only use super columns when you put
a small number of subcolumns into it (how many depends on the actual
size of the columns, but if we take the example of counters, I'd say
that if you have supercolumns with more than 100 counters, you should
probably change your model to use composite columns instead of super
columns).

--
Sylvain

>
> insert first counter into supercolumn
> insert second counter into supercolumn
> read entire supercolumn (now 2 wide)
> insert third counter into supercolumn
> read entire supercolumn (now 3 wide)
> insert fourth counter into supercolumn
> read entire supercolumn (now 4 wide)
> ...
> insert 20,000th counter into supercolumn
> read entire supercolumn for the 20,000th time (now 20,000 columns wide)
>
> What happens if I turn replicate on write off and go to RF=3 on a multi-node cluster?
>
> Unfortunately, I don't see a way to get the size of the largest supercolumn in the same way you can get the size of the largest row, so I don't know what our max number of columns in any supercolumn is.
>
> The test I was running against the single-node cluster just died, here's a graph.  It held steady at 2.5-3k inserts/sec for a while, and then cassandra became unresponsive to JMX requests for a while (that's the sharp dip to 0 at 15:48), after which you can see the ReplicateOnWrite Pending Tasks creep upwards of 1M when the max row size and max size of any CF on disk both spike.  You can also see the total number of reads done increase sharply at the same time.  All stats are absolute values with the exception of the inserts/sec, which was multiplied by 10 so it would show up with everything else.  Inserts/sec are from the client's perspective, not from cassandra's.  I can also tell you that the client is seeing a lot of hector timeout exceptions and retry burden has been pushed back to client.
>
>
>
>
>
>
>
> On Jul 12, 2011, at 3:26 PM, Sylvain Lebresne wrote:
>
>> On Wed, Jul 13, 2011 at 12:18 AM, David Hawthorne <dh...@gmx.3crowd.com> wrote:
>>>
>>> On Jul 12, 2011, at 3:02 PM, Sylvain Lebresne wrote:
>>>
>>>> On Tue, Jul 12, 2011 at 11:42 PM, David Hawthorne <dh...@gmx.3crowd.com> wrote:
>>>>> Well, I was using a large number of clients:  I tried configuring a hector pool of 20-200 to see what affect that had on throughput. There's definitely a point after which there's no gain, so I dialed it back down.  To clarify a few other things, when I say inserts I mean increments, as this test only involves counters.  When I say a row is full, I mean I've written all of the data into it that belongs in it from our application's perspective.  Not all rows are the same length, and the average number of subcolumns per supercolumn is higher for some rows than for others.
>>>>>
>>>>
>>>> Oh yes, you're using supercolumns. That could help magnify the problem
>>>> a little bit, given that to read any subcolumn of a supercolumn, we
>>>> have to read the whole supercolumn. Depends how many counters per
>>>> supercolumns we're talking about.
>>>
>>> Does this imply that incrementing 50 counters in one supercolumn in one row in one CF requires 50! reads of the supercolumn?  One read of the entire supercolumn for each insert?
>>
>> Every increment of one of the 50 counters means reading the whole
>> superColumn, i.e, 50 counters. That is, there is not more reads, but
>> the reads themselves have to read more and so can take longer.
>>
>>>
>>>
>>>>
>>>>> I'm curious, is it necessary for a read to happen for replication when using RF=1?
>>>>
>>>> No, it's not necessary with RF=1 ... but we still do it :(. Ok, it's
>>>> basically a bug, we'll
>>>> fix that for 0.8.2.
>>>>
>>>>> I am currently running a test against a single node cluster, RF=1, CL.ONE, 4 connections/concurrent increments, and graphing various data points.  If something interesting happens, I'll send a graph.
>>>>>
>>>>> When you say stagger writes, do you mean just lower the overall number of increments/sec my client is attempting,
>>>>
>>>> Yes, this is what I mean.
>>>>
>>>>> or do you mean I should stagger the load amongst many CFs/rows at once?  Right now I'm writing heavily into a small set of rows, each going into 3 CFs, before moving on to the next small set of rows.
>>>>>
>>>>>
>>>>> On Jul 12, 2011, at 12:26 PM, Sylvain Lebresne wrote:
>>>>>
>>>>>> When you do counter increment at CL.ONE, the write is acknowledged as
>>>>>> soon as the first replica getting the the write has pushed the
>>>>>> increment into his memtable. However, there is a read happening for
>>>>>> the replication to the other replicas (this is necessary to the
>>>>>> counter design). What is happening is that our nodes are not able to
>>>>>> do those reads fast enough. Or, another way to see it is that you are
>>>>>> writing faster than you cluster can handle.
>>>>>>
>>>>>> I've opened https://issues.apache.org/jira/browse/CASSANDRA-2889 to do
>>>>>> something about that, but for now you'll have to stagger your writes.
>>>>>> I'm more surprised that tasks on the replicate on write stage would
>>>>>> stack up at QUORUM, unless you're using a very large number of
>>>>>> clients.
>>>>>>
>>>>>> --
>>>>>> Sylvain
>>>>>>
>>>>>> On Tue, Jul 12, 2011 at 8:57 AM, David Hawthorne <dh...@gmx.3crowd.com> wrote:
>>>>>>> This is now my fourth attempt to get the message through.  Apologies if you see multiple copies.
>>>>>>>
>>>>>>> I've tried to give as much relevant data as I can think of, but please let me know if you need any other info.  I spent the day getting jmxtrans to talk to statsd with the cassandra JMX data, so I can provide graphs of just about anything you want.  I can also re-run the tests to replicate the problem if need be.  As always, thanks for taking a look!
>>>>>>>
>>>>>>> I have a 4 host test cluster that I'm writing counters into running 0.8.1, and I'm writing to it with hector with CL.ONE or CL.QUORUM (test was run with each and both gave the same results).  Partitioner is the RandomPartitioner, and RF=3.  nodetool ring shows that actual data is well balanced:
>>>>>>>
>>>>>>> Address         DC          Rack        Status State   Load            Owns    Token
>>>>>>>                                                                              127605887595351923798765477786913079296
>>>>>>> 10.0.0.57    datacenter1 rack1       Up     Normal  1.59 GB         25.00%  0
>>>>>>> 10.0.0.56    datacenter1 rack1       Up     Normal  1.39 GB         25.00%  42535295865117307932921825928971026432
>>>>>>> 10.0.0.55    datacenter1 rack1       Up     Normal  1.24 GB         25.00%  85070591730234615865843651857942052864
>>>>>>> 10.0.0.54    datacenter1 rack1       Up     Normal  1.34 GB         25.00%  127605887595351923798765477786913079296
>>>>>>>
>>>>>>> I also tested against a 1-box test cluster with RF=1.
>>>>>>>
>>>>>>> Both show the same results:  high performance for a while, and then ReplicateOnWrite backs way up (I've seen it as high as a million), the cassandra process becomes unresponsive, and the hector client starts throwing exceptions.  High performance in this case equates to about 5-20k inserts/sec on both the 4-box and 1-box cluster (I'm inserting one long row at a time, so the performance is bounded by the one box owning the row).  The cassandra process never seems to recover, even if left for 24 hours.  It's still chewing through those ReplicateOnWrite pending tasks.
>>>>>>>
>>>>>>> GC doesn't seem to be a factor.  Logs show GCs completing in < 1s on all 4 boxes in the test cluster, as well as on the 1 standalone server.
>>>>>>>
>>>>>>> I'm also seeing highly erratic performance from the cluster in general:  inserts/sec usually start at around 3k, but then maybe creep up towards 6k, then drop back down to below 1k for an extended period of time, then maybe 1% of the time they'll spike up to 8k or even 15k, but that doesn't last more than 30 seconds.
>>>>>>>
>>>>>>> Here's the tpstats output from all 4 boxes:
>>>>>>>
>>>>>>> Slowest/most problematic/highest cpu usage one:
>>>>>>>
>>>>>>> This tpstats output is also representative of the one box cluster about 20 minutes into my insert run.
>>>>>>>
>>>>>>> 4 $ ./nodetool -h localhost tpstats
>>>>>>> Pool Name                    Active   Pending      Completed
>>>>>>> ReadStage                         0         0              0
>>>>>>> RequestResponseStage              0         0       59095840
>>>>>>> MutationStage                     0         0       31182942
>>>>>>> ReadRepairStage                   0         0              0
>>>>>>> ReplicateOnWriteStage            32    928042       28246716
>>>>>>> GossipStage                       1         1         449464
>>>>>>> AntiEntropyStage                  0         0              0
>>>>>>> MigrationStage                    0         0              4
>>>>>>> MemtablePostFlusher               0         0            445
>>>>>>> StreamStage                       0         0              0
>>>>>>> FlushWriter                       0         0            445
>>>>>>> FILEUTILS-DELETE-POOL             0         0             38
>>>>>>> MiscStage                         0         0              0
>>>>>>> FlushSorter                       0         0              0
>>>>>>> InternalResponseStage             0         0              4
>>>>>>> HintedHandoff                     0         0              0
>>>>>>>
>>>>>>> Second highest cpu usage one:
>>>>>>>
>>>>>>> 3 $ ./nodetool -h localhost tpstats
>>>>>>> Pool Name                    Active   Pending      Completed
>>>>>>> ReadStage                         0         0              0
>>>>>>> RequestResponseStage              0         0       27413910
>>>>>>> MutationStage                     0         0       27523094
>>>>>>> ReadRepairStage                   0         0              0
>>>>>>> ReplicateOnWriteStage             0         0       11111971
>>>>>>> GossipStage                       0         0         335422
>>>>>>> AntiEntropyStage                  0         0              0
>>>>>>> MigrationStage                    0         0              4
>>>>>>> MemtablePostFlusher               0         0             66
>>>>>>> StreamStage                       0         0              0
>>>>>>> FlushWriter                       0         0             66
>>>>>>> FILEUTILS-DELETE-POOL             0         0             31
>>>>>>> MiscStage                         0         0              0
>>>>>>> FlushSorter                       0         0              0
>>>>>>> InternalResponseStage             0         0              4
>>>>>>> HintedHandoff                     0         0              0
>>>>>>>
>>>>>>> The other 2 that see minimal cpu usage:
>>>>>>>
>>>>>>> 2 $ ./nodetool -h localhost tpstats
>>>>>>> Pool Name                    Active   Pending      Completed
>>>>>>> ReadStage                         0         0              0
>>>>>>> RequestResponseStage              0         0        9213999
>>>>>>> MutationStage                     0         0       33233114
>>>>>>> ReadRepairStage                   0         0              0
>>>>>>> ReplicateOnWriteStage             0         0              0
>>>>>>> GossipStage                       0         0         336016
>>>>>>> AntiEntropyStage                  0         0              0
>>>>>>> MigrationStage                    0         0              4
>>>>>>> MemtablePostFlusher               0         0             77
>>>>>>> StreamStage                       0         0              0
>>>>>>> FlushWriter                       0         0             77
>>>>>>> FILEUTILS-DELETE-POOL             0         0             40
>>>>>>> MiscStage                         0         0              0
>>>>>>> FlushSorter                       0         0              0
>>>>>>> InternalResponseStage             0         0              4
>>>>>>> HintedHandoff                     0         0              0
>>>>>>>
>>>>>>>
>>>>>>> 1 $ ./nodetool -h localhost tpstats
>>>>>>> Pool Name                    Active   Pending      Completed
>>>>>>> ReadStage                         0         0              0
>>>>>>> RequestResponseStage              0         0       13796875
>>>>>>> MutationStage                     0         0       36116662
>>>>>>> ReadRepairStage                   0         0              0
>>>>>>> ReplicateOnWriteStage             0         0              0
>>>>>>> GossipStage                       0         0         336496
>>>>>>> AntiEntropyStage                  0         0              0
>>>>>>> MigrationStage                    0         0              4
>>>>>>> MemtablePostFlusher               0         0             91
>>>>>>> StreamStage                       0         0              0
>>>>>>> FlushWriter                       0         0             91
>>>>>>> FILEUTILS-DELETE-POOL             0         0             43
>>>>>>> MiscStage                         0         0              0
>>>>>>> FlushSorter                       0         0              0
>>>>>>> InternalResponseStage             0         0              4
>>>>>>> HintedHandoff                     0         0              0
>>>>>
>>>>>
>>>
>>>
>
>
>

Re: ReplicateOnWrite issues

Posted by David Hawthorne <dh...@gmx.3crowd.com>.

Thanks for looking at that.

Our use case involves supercolumns that have 2-20,000 counters within them. For a set of continuous updates to one supercolumn, the behavior you're describing is:

insert first counter into supercolumn
insert second counter into supercolumn
read entire supercolumn (now 2 wide)
insert third counter into supercolumn
read entire supercolumn (now 3 wide)
insert fourth counter into supercolumn
read entire supercolumn (now 4 wide)
...
insert 20,000th counter into supercolumn
read entire supercolumn for the 20,000th time (now 20,000 columns wide)

What happens if I turn replicate on write off and go to RF=3 on a multi-node cluster?

Unfortunately, I don't see a way to get the size of the largest supercolumn in the same way you can get the size of the largest row, so I don't know what our max number of columns in any supercolumn is.

The test I was running against the single-node cluster just died, here's a graph. It held steady at 2.5-3k inserts/sec for a while, and then cassandra became unresponsive to JMX requests for a while (that's the sharp dip to 0 at 15:48), after which you can see the ReplicateOnWrite Pending Tasks creep upwards of 1M when the max row size and max size of any CF on disk both spike. You can also see the total number of reads done increase sharply at the same time. All stats are absolute values with the exception of the inserts/sec, which was multiplied by 10 so it would show up with everything else. Inserts/sec are from the client's perspective, not from cassandra's. I can also tell you that the client is seeing a lot of hector timeout exceptions and retry burden has been pushed back to client.

Re: ReplicateOnWrite issues

Posted by Sylvain Lebresne <sy...@datastax.com>.

On Wed, Jul 13, 2011 at 12:18 AM, David Hawthorne <dh...@gmx.3crowd.com> wrote:
>
> On Jul 12, 2011, at 3:02 PM, Sylvain Lebresne wrote:
>
>> On Tue, Jul 12, 2011 at 11:42 PM, David Hawthorne <dh...@gmx.3crowd.com> wrote:
>>> Well, I was using a large number of clients:  I tried configuring a hector pool of 20-200 to see what affect that had on throughput. There's definitely a point after which there's no gain, so I dialed it back down.  To clarify a few other things, when I say inserts I mean increments, as this test only involves counters.  When I say a row is full, I mean I've written all of the data into it that belongs in it from our application's perspective.  Not all rows are the same length, and the average number of subcolumns per supercolumn is higher for some rows than for others.
>>>
>>
>> Oh yes, you're using supercolumns. That could help magnify the problem
>> a little bit, given that to read any subcolumn of a supercolumn, we
>> have to read the whole supercolumn. Depends how many counters per
>> supercolumns we're talking about.
>
> Does this imply that incrementing 50 counters in one supercolumn in one row in one CF requires 50! reads of the supercolumn?  One read of the entire supercolumn for each insert?

Every increment of one of the 50 counters means reading the whole
superColumn, i.e, 50 counters. That is, there is not more reads, but
the reads themselves have to read more and so can take longer.

>
>
>>
>>> I'm curious, is it necessary for a read to happen for replication when using RF=1?
>>
>> No, it's not necessary with RF=1 ... but we still do it :(. Ok, it's
>> basically a bug, we'll
>> fix that for 0.8.2.
>>
>>> I am currently running a test against a single node cluster, RF=1, CL.ONE, 4 connections/concurrent increments, and graphing various data points.  If something interesting happens, I'll send a graph.
>>>
>>> When you say stagger writes, do you mean just lower the overall number of increments/sec my client is attempting,
>>
>> Yes, this is what I mean.
>>
>>> or do you mean I should stagger the load amongst many CFs/rows at once?  Right now I'm writing heavily into a small set of rows, each going into 3 CFs, before moving on to the next small set of rows.
>>>
>>>
>>> On Jul 12, 2011, at 12:26 PM, Sylvain Lebresne wrote:
>>>
>>>> When you do counter increment at CL.ONE, the write is acknowledged as
>>>> soon as the first replica getting the the write has pushed the
>>>> increment into his memtable. However, there is a read happening for
>>>> the replication to the other replicas (this is necessary to the
>>>> counter design). What is happening is that our nodes are not able to
>>>> do those reads fast enough. Or, another way to see it is that you are
>>>> writing faster than you cluster can handle.
>>>>
>>>> I've opened https://issues.apache.org/jira/browse/CASSANDRA-2889 to do
>>>> something about that, but for now you'll have to stagger your writes.
>>>> I'm more surprised that tasks on the replicate on write stage would
>>>> stack up at QUORUM, unless you're using a very large number of
>>>> clients.
>>>>
>>>> --
>>>> Sylvain
>>>>
>>>> On Tue, Jul 12, 2011 at 8:57 AM, David Hawthorne <dh...@gmx.3crowd.com> wrote:
>>>>> This is now my fourth attempt to get the message through.  Apologies if you see multiple copies.
>>>>>
>>>>> I've tried to give as much relevant data as I can think of, but please let me know if you need any other info.  I spent the day getting jmxtrans to talk to statsd with the cassandra JMX data, so I can provide graphs of just about anything you want.  I can also re-run the tests to replicate the problem if need be.  As always, thanks for taking a look!
>>>>>
>>>>> I have a 4 host test cluster that I'm writing counters into running 0.8.1, and I'm writing to it with hector with CL.ONE or CL.QUORUM (test was run with each and both gave the same results).  Partitioner is the RandomPartitioner, and RF=3.  nodetool ring shows that actual data is well balanced:
>>>>>
>>>>> Address         DC          Rack        Status State   Load            Owns    Token
>>>>>                                                                              127605887595351923798765477786913079296
>>>>> 10.0.0.57    datacenter1 rack1       Up     Normal  1.59 GB         25.00%  0
>>>>> 10.0.0.56    datacenter1 rack1       Up     Normal  1.39 GB         25.00%  42535295865117307932921825928971026432
>>>>> 10.0.0.55    datacenter1 rack1       Up     Normal  1.24 GB         25.00%  85070591730234615865843651857942052864
>>>>> 10.0.0.54    datacenter1 rack1       Up     Normal  1.34 GB         25.00%  127605887595351923798765477786913079296
>>>>>
>>>>> I also tested against a 1-box test cluster with RF=1.
>>>>>
>>>>> Both show the same results:  high performance for a while, and then ReplicateOnWrite backs way up (I've seen it as high as a million), the cassandra process becomes unresponsive, and the hector client starts throwing exceptions.  High performance in this case equates to about 5-20k inserts/sec on both the 4-box and 1-box cluster (I'm inserting one long row at a time, so the performance is bounded by the one box owning the row).  The cassandra process never seems to recover, even if left for 24 hours.  It's still chewing through those ReplicateOnWrite pending tasks.
>>>>>
>>>>> GC doesn't seem to be a factor.  Logs show GCs completing in < 1s on all 4 boxes in the test cluster, as well as on the 1 standalone server.
>>>>>
>>>>> I'm also seeing highly erratic performance from the cluster in general:  inserts/sec usually start at around 3k, but then maybe creep up towards 6k, then drop back down to below 1k for an extended period of time, then maybe 1% of the time they'll spike up to 8k or even 15k, but that doesn't last more than 30 seconds.
>>>>>
>>>>> Here's the tpstats output from all 4 boxes:
>>>>>
>>>>> Slowest/most problematic/highest cpu usage one:
>>>>>
>>>>> This tpstats output is also representative of the one box cluster about 20 minutes into my insert run.
>>>>>
>>>>> 4 $ ./nodetool -h localhost tpstats
>>>>> Pool Name                    Active   Pending      Completed
>>>>> ReadStage                         0         0              0
>>>>> RequestResponseStage              0         0       59095840
>>>>> MutationStage                     0         0       31182942
>>>>> ReadRepairStage                   0         0              0
>>>>> ReplicateOnWriteStage            32    928042       28246716
>>>>> GossipStage                       1         1         449464
>>>>> AntiEntropyStage                  0         0              0
>>>>> MigrationStage                    0         0              4
>>>>> MemtablePostFlusher               0         0            445
>>>>> StreamStage                       0         0              0
>>>>> FlushWriter                       0         0            445
>>>>> FILEUTILS-DELETE-POOL             0         0             38
>>>>> MiscStage                         0         0              0
>>>>> FlushSorter                       0         0              0
>>>>> InternalResponseStage             0         0              4
>>>>> HintedHandoff                     0         0              0
>>>>>
>>>>> Second highest cpu usage one:
>>>>>
>>>>> 3 $ ./nodetool -h localhost tpstats
>>>>> Pool Name                    Active   Pending      Completed
>>>>> ReadStage                         0         0              0
>>>>> RequestResponseStage              0         0       27413910
>>>>> MutationStage                     0         0       27523094
>>>>> ReadRepairStage                   0         0              0
>>>>> ReplicateOnWriteStage             0         0       11111971
>>>>> GossipStage                       0         0         335422
>>>>> AntiEntropyStage                  0         0              0
>>>>> MigrationStage                    0         0              4
>>>>> MemtablePostFlusher               0         0             66
>>>>> StreamStage                       0         0              0
>>>>> FlushWriter                       0         0             66
>>>>> FILEUTILS-DELETE-POOL             0         0             31
>>>>> MiscStage                         0         0              0
>>>>> FlushSorter                       0         0              0
>>>>> InternalResponseStage             0         0              4
>>>>> HintedHandoff                     0         0              0
>>>>>
>>>>> The other 2 that see minimal cpu usage:
>>>>>
>>>>> 2 $ ./nodetool -h localhost tpstats
>>>>> Pool Name                    Active   Pending      Completed
>>>>> ReadStage                         0         0              0
>>>>> RequestResponseStage              0         0        9213999
>>>>> MutationStage                     0         0       33233114
>>>>> ReadRepairStage                   0         0              0
>>>>> ReplicateOnWriteStage             0         0              0
>>>>> GossipStage                       0         0         336016
>>>>> AntiEntropyStage                  0         0              0
>>>>> MigrationStage                    0         0              4
>>>>> MemtablePostFlusher               0         0             77
>>>>> StreamStage                       0         0              0
>>>>> FlushWriter                       0         0             77
>>>>> FILEUTILS-DELETE-POOL             0         0             40
>>>>> MiscStage                         0         0              0
>>>>> FlushSorter                       0         0              0
>>>>> InternalResponseStage             0         0              4
>>>>> HintedHandoff                     0         0              0
>>>>>
>>>>>
>>>>> 1 $ ./nodetool -h localhost tpstats
>>>>> Pool Name                    Active   Pending      Completed
>>>>> ReadStage                         0         0              0
>>>>> RequestResponseStage              0         0       13796875
>>>>> MutationStage                     0         0       36116662
>>>>> ReadRepairStage                   0         0              0
>>>>> ReplicateOnWriteStage             0         0              0
>>>>> GossipStage                       0         0         336496
>>>>> AntiEntropyStage                  0         0              0
>>>>> MigrationStage                    0         0              4
>>>>> MemtablePostFlusher               0         0             91
>>>>> StreamStage                       0         0              0
>>>>> FlushWriter                       0         0             91
>>>>> FILEUTILS-DELETE-POOL             0         0             43
>>>>> MiscStage                         0         0              0
>>>>> FlushSorter                       0         0              0
>>>>> InternalResponseStage             0         0              4
>>>>> HintedHandoff                     0         0              0
>>>
>>>
>
>

Re: ReplicateOnWrite issues

Posted by David Hawthorne <dh...@gmx.3crowd.com>.

On Jul 12, 2011, at 3:02 PM, Sylvain Lebresne wrote:

> On Tue, Jul 12, 2011 at 11:42 PM, David Hawthorne <dh...@gmx.3crowd.com> wrote:
>> Well, I was using a large number of clients:  I tried configuring a hector pool of 20-200 to see what affect that had on throughput. There's definitely a point after which there's no gain, so I dialed it back down.  To clarify a few other things, when I say inserts I mean increments, as this test only involves counters.  When I say a row is full, I mean I've written all of the data into it that belongs in it from our application's perspective.  Not all rows are the same length, and the average number of subcolumns per supercolumn is higher for some rows than for others.
>> 
> 
> Oh yes, you're using supercolumns. That could help magnify the problem
> a little bit, given that to read any subcolumn of a supercolumn, we
> have to read the whole supercolumn. Depends how many counters per
> supercolumns we're talking about.

Does this imply that incrementing 50 counters in one supercolumn in one row in one CF requires 50! reads of the supercolumn?  One read of the entire supercolumn for each insert? 


> 
>> I'm curious, is it necessary for a read to happen for replication when using RF=1?
> 
> No, it's not necessary with RF=1 ... but we still do it :(. Ok, it's
> basically a bug, we'll
> fix that for 0.8.2.
> 
>> I am currently running a test against a single node cluster, RF=1, CL.ONE, 4 connections/concurrent increments, and graphing various data points.  If something interesting happens, I'll send a graph.
>> 
>> When you say stagger writes, do you mean just lower the overall number of increments/sec my client is attempting,
> 
> Yes, this is what I mean.
> 
>> or do you mean I should stagger the load amongst many CFs/rows at once?  Right now I'm writing heavily into a small set of rows, each going into 3 CFs, before moving on to the next small set of rows.
>> 
>> 
>> On Jul 12, 2011, at 12:26 PM, Sylvain Lebresne wrote:
>> 
>>> When you do counter increment at CL.ONE, the write is acknowledged as
>>> soon as the first replica getting the the write has pushed the
>>> increment into his memtable. However, there is a read happening for
>>> the replication to the other replicas (this is necessary to the
>>> counter design). What is happening is that our nodes are not able to
>>> do those reads fast enough. Or, another way to see it is that you are
>>> writing faster than you cluster can handle.
>>> 
>>> I've opened https://issues.apache.org/jira/browse/CASSANDRA-2889 to do
>>> something about that, but for now you'll have to stagger your writes.
>>> I'm more surprised that tasks on the replicate on write stage would
>>> stack up at QUORUM, unless you're using a very large number of
>>> clients.
>>> 
>>> --
>>> Sylvain
>>> 
>>> On Tue, Jul 12, 2011 at 8:57 AM, David Hawthorne <dh...@gmx.3crowd.com> wrote:
>>>> This is now my fourth attempt to get the message through.  Apologies if you see multiple copies.
>>>> 
>>>> I've tried to give as much relevant data as I can think of, but please let me know if you need any other info.  I spent the day getting jmxtrans to talk to statsd with the cassandra JMX data, so I can provide graphs of just about anything you want.  I can also re-run the tests to replicate the problem if need be.  As always, thanks for taking a look!
>>>> 
>>>> I have a 4 host test cluster that I'm writing counters into running 0.8.1, and I'm writing to it with hector with CL.ONE or CL.QUORUM (test was run with each and both gave the same results).  Partitioner is the RandomPartitioner, and RF=3.  nodetool ring shows that actual data is well balanced:
>>>> 
>>>> Address         DC          Rack        Status State   Load            Owns    Token
>>>>                                                                              127605887595351923798765477786913079296
>>>> 10.0.0.57    datacenter1 rack1       Up     Normal  1.59 GB         25.00%  0
>>>> 10.0.0.56    datacenter1 rack1       Up     Normal  1.39 GB         25.00%  42535295865117307932921825928971026432
>>>> 10.0.0.55    datacenter1 rack1       Up     Normal  1.24 GB         25.00%  85070591730234615865843651857942052864
>>>> 10.0.0.54    datacenter1 rack1       Up     Normal  1.34 GB         25.00%  127605887595351923798765477786913079296
>>>> 
>>>> I also tested against a 1-box test cluster with RF=1.
>>>> 
>>>> Both show the same results:  high performance for a while, and then ReplicateOnWrite backs way up (I've seen it as high as a million), the cassandra process becomes unresponsive, and the hector client starts throwing exceptions.  High performance in this case equates to about 5-20k inserts/sec on both the 4-box and 1-box cluster (I'm inserting one long row at a time, so the performance is bounded by the one box owning the row).  The cassandra process never seems to recover, even if left for 24 hours.  It's still chewing through those ReplicateOnWrite pending tasks.
>>>> 
>>>> GC doesn't seem to be a factor.  Logs show GCs completing in < 1s on all 4 boxes in the test cluster, as well as on the 1 standalone server.
>>>> 
>>>> I'm also seeing highly erratic performance from the cluster in general:  inserts/sec usually start at around 3k, but then maybe creep up towards 6k, then drop back down to below 1k for an extended period of time, then maybe 1% of the time they'll spike up to 8k or even 15k, but that doesn't last more than 30 seconds.
>>>> 
>>>> Here's the tpstats output from all 4 boxes:
>>>> 
>>>> Slowest/most problematic/highest cpu usage one:
>>>> 
>>>> This tpstats output is also representative of the one box cluster about 20 minutes into my insert run.
>>>> 
>>>> 4 $ ./nodetool -h localhost tpstats
>>>> Pool Name                    Active   Pending      Completed
>>>> ReadStage                         0         0              0
>>>> RequestResponseStage              0         0       59095840
>>>> MutationStage                     0         0       31182942
>>>> ReadRepairStage                   0         0              0
>>>> ReplicateOnWriteStage            32    928042       28246716
>>>> GossipStage                       1         1         449464
>>>> AntiEntropyStage                  0         0              0
>>>> MigrationStage                    0         0              4
>>>> MemtablePostFlusher               0         0            445
>>>> StreamStage                       0         0              0
>>>> FlushWriter                       0         0            445
>>>> FILEUTILS-DELETE-POOL             0         0             38
>>>> MiscStage                         0         0              0
>>>> FlushSorter                       0         0              0
>>>> InternalResponseStage             0         0              4
>>>> HintedHandoff                     0         0              0
>>>> 
>>>> Second highest cpu usage one:
>>>> 
>>>> 3 $ ./nodetool -h localhost tpstats
>>>> Pool Name                    Active   Pending      Completed
>>>> ReadStage                         0         0              0
>>>> RequestResponseStage              0         0       27413910
>>>> MutationStage                     0         0       27523094
>>>> ReadRepairStage                   0         0              0
>>>> ReplicateOnWriteStage             0         0       11111971
>>>> GossipStage                       0         0         335422
>>>> AntiEntropyStage                  0         0              0
>>>> MigrationStage                    0         0              4
>>>> MemtablePostFlusher               0         0             66
>>>> StreamStage                       0         0              0
>>>> FlushWriter                       0         0             66
>>>> FILEUTILS-DELETE-POOL             0         0             31
>>>> MiscStage                         0         0              0
>>>> FlushSorter                       0         0              0
>>>> InternalResponseStage             0         0              4
>>>> HintedHandoff                     0         0              0
>>>> 
>>>> The other 2 that see minimal cpu usage:
>>>> 
>>>> 2 $ ./nodetool -h localhost tpstats
>>>> Pool Name                    Active   Pending      Completed
>>>> ReadStage                         0         0              0
>>>> RequestResponseStage              0         0        9213999
>>>> MutationStage                     0         0       33233114
>>>> ReadRepairStage                   0         0              0
>>>> ReplicateOnWriteStage             0         0              0
>>>> GossipStage                       0         0         336016
>>>> AntiEntropyStage                  0         0              0
>>>> MigrationStage                    0         0              4
>>>> MemtablePostFlusher               0         0             77
>>>> StreamStage                       0         0              0
>>>> FlushWriter                       0         0             77
>>>> FILEUTILS-DELETE-POOL             0         0             40
>>>> MiscStage                         0         0              0
>>>> FlushSorter                       0         0              0
>>>> InternalResponseStage             0         0              4
>>>> HintedHandoff                     0         0              0
>>>> 
>>>> 
>>>> 1 $ ./nodetool -h localhost tpstats
>>>> Pool Name                    Active   Pending      Completed
>>>> ReadStage                         0         0              0
>>>> RequestResponseStage              0         0       13796875
>>>> MutationStage                     0         0       36116662
>>>> ReadRepairStage                   0         0              0
>>>> ReplicateOnWriteStage             0         0              0
>>>> GossipStage                       0         0         336496
>>>> AntiEntropyStage                  0         0              0
>>>> MigrationStage                    0         0              4
>>>> MemtablePostFlusher               0         0             91
>>>> StreamStage                       0         0              0
>>>> FlushWriter                       0         0             91
>>>> FILEUTILS-DELETE-POOL             0         0             43
>>>> MiscStage                         0         0              0
>>>> FlushSorter                       0         0              0
>>>> InternalResponseStage             0         0              4
>>>> HintedHandoff                     0         0              0
>> 
>>

Re: ReplicateOnWrite issues

Posted by Sylvain Lebresne <sy...@datastax.com>.

On Tue, Jul 12, 2011 at 11:42 PM, David Hawthorne <dh...@gmx.3crowd.com> wrote:
> Well, I was using a large number of clients:  I tried configuring a hector pool of 20-200 to see what affect that had on throughput. There's definitely a point after which there's no gain, so I dialed it back down.  To clarify a few other things, when I say inserts I mean increments, as this test only involves counters.  When I say a row is full, I mean I've written all of the data into it that belongs in it from our application's perspective.  Not all rows are the same length, and the average number of subcolumns per supercolumn is higher for some rows than for others.
>

Oh yes, you're using supercolumns. That could help magnify the problem
a little bit, given that to read any subcolumn of a supercolumn, we
have to read the whole supercolumn. Depends how many counters per
supercolumns we're talking about.

> I'm curious, is it necessary for a read to happen for replication when using RF=1?

No, it's not necessary with RF=1 ... but we still do it :(. Ok, it's
basically a bug, we'll
fix that for 0.8.2.

> I am currently running a test against a single node cluster, RF=1, CL.ONE, 4 connections/concurrent increments, and graphing various data points.  If something interesting happens, I'll send a graph.
>
> When you say stagger writes, do you mean just lower the overall number of increments/sec my client is attempting,

Yes, this is what I mean.

>or do you mean I should stagger the load amongst many CFs/rows at once?  Right now I'm writing heavily into a small set of rows, each going into 3 CFs, before moving on to the next small set of rows.
>
>
> On Jul 12, 2011, at 12:26 PM, Sylvain Lebresne wrote:
>
>> When you do counter increment at CL.ONE, the write is acknowledged as
>> soon as the first replica getting the the write has pushed the
>> increment into his memtable. However, there is a read happening for
>> the replication to the other replicas (this is necessary to the
>> counter design). What is happening is that our nodes are not able to
>> do those reads fast enough. Or, another way to see it is that you are
>> writing faster than you cluster can handle.
>>
>> I've opened https://issues.apache.org/jira/browse/CASSANDRA-2889 to do
>> something about that, but for now you'll have to stagger your writes.
>> I'm more surprised that tasks on the replicate on write stage would
>> stack up at QUORUM, unless you're using a very large number of
>> clients.
>>
>> --
>> Sylvain
>>
>> On Tue, Jul 12, 2011 at 8:57 AM, David Hawthorne <dh...@gmx.3crowd.com> wrote:
>>> This is now my fourth attempt to get the message through.  Apologies if you see multiple copies.
>>>
>>> I've tried to give as much relevant data as I can think of, but please let me know if you need any other info.  I spent the day getting jmxtrans to talk to statsd with the cassandra JMX data, so I can provide graphs of just about anything you want.  I can also re-run the tests to replicate the problem if need be.  As always, thanks for taking a look!
>>>
>>> I have a 4 host test cluster that I'm writing counters into running 0.8.1, and I'm writing to it with hector with CL.ONE or CL.QUORUM (test was run with each and both gave the same results).  Partitioner is the RandomPartitioner, and RF=3.  nodetool ring shows that actual data is well balanced:
>>>
>>> Address         DC          Rack        Status State   Load            Owns    Token
>>>                                                                              127605887595351923798765477786913079296
>>> 10.0.0.57    datacenter1 rack1       Up     Normal  1.59 GB         25.00%  0
>>> 10.0.0.56    datacenter1 rack1       Up     Normal  1.39 GB         25.00%  42535295865117307932921825928971026432
>>> 10.0.0.55    datacenter1 rack1       Up     Normal  1.24 GB         25.00%  85070591730234615865843651857942052864
>>> 10.0.0.54    datacenter1 rack1       Up     Normal  1.34 GB         25.00%  127605887595351923798765477786913079296
>>>
>>> I also tested against a 1-box test cluster with RF=1.
>>>
>>> Both show the same results:  high performance for a while, and then ReplicateOnWrite backs way up (I've seen it as high as a million), the cassandra process becomes unresponsive, and the hector client starts throwing exceptions.  High performance in this case equates to about 5-20k inserts/sec on both the 4-box and 1-box cluster (I'm inserting one long row at a time, so the performance is bounded by the one box owning the row).  The cassandra process never seems to recover, even if left for 24 hours.  It's still chewing through those ReplicateOnWrite pending tasks.
>>>
>>> GC doesn't seem to be a factor.  Logs show GCs completing in < 1s on all 4 boxes in the test cluster, as well as on the 1 standalone server.
>>>
>>> I'm also seeing highly erratic performance from the cluster in general:  inserts/sec usually start at around 3k, but then maybe creep up towards 6k, then drop back down to below 1k for an extended period of time, then maybe 1% of the time they'll spike up to 8k or even 15k, but that doesn't last more than 30 seconds.
>>>
>>> Here's the tpstats output from all 4 boxes:
>>>
>>> Slowest/most problematic/highest cpu usage one:
>>>
>>> This tpstats output is also representative of the one box cluster about 20 minutes into my insert run.
>>>
>>> 4 $ ./nodetool -h localhost tpstats
>>> Pool Name                    Active   Pending      Completed
>>> ReadStage                         0         0              0
>>> RequestResponseStage              0         0       59095840
>>> MutationStage                     0         0       31182942
>>> ReadRepairStage                   0         0              0
>>> ReplicateOnWriteStage            32    928042       28246716
>>> GossipStage                       1         1         449464
>>> AntiEntropyStage                  0         0              0
>>> MigrationStage                    0         0              4
>>> MemtablePostFlusher               0         0            445
>>> StreamStage                       0         0              0
>>> FlushWriter                       0         0            445
>>> FILEUTILS-DELETE-POOL             0         0             38
>>> MiscStage                         0         0              0
>>> FlushSorter                       0         0              0
>>> InternalResponseStage             0         0              4
>>> HintedHandoff                     0         0              0
>>>
>>> Second highest cpu usage one:
>>>
>>> 3 $ ./nodetool -h localhost tpstats
>>> Pool Name                    Active   Pending      Completed
>>> ReadStage                         0         0              0
>>> RequestResponseStage              0         0       27413910
>>> MutationStage                     0         0       27523094
>>> ReadRepairStage                   0         0              0
>>> ReplicateOnWriteStage             0         0       11111971
>>> GossipStage                       0         0         335422
>>> AntiEntropyStage                  0         0              0
>>> MigrationStage                    0         0              4
>>> MemtablePostFlusher               0         0             66
>>> StreamStage                       0         0              0
>>> FlushWriter                       0         0             66
>>> FILEUTILS-DELETE-POOL             0         0             31
>>> MiscStage                         0         0              0
>>> FlushSorter                       0         0              0
>>> InternalResponseStage             0         0              4
>>> HintedHandoff                     0         0              0
>>>
>>> The other 2 that see minimal cpu usage:
>>>
>>> 2 $ ./nodetool -h localhost tpstats
>>> Pool Name                    Active   Pending      Completed
>>> ReadStage                         0         0              0
>>> RequestResponseStage              0         0        9213999
>>> MutationStage                     0         0       33233114
>>> ReadRepairStage                   0         0              0
>>> ReplicateOnWriteStage             0         0              0
>>> GossipStage                       0         0         336016
>>> AntiEntropyStage                  0         0              0
>>> MigrationStage                    0         0              4
>>> MemtablePostFlusher               0         0             77
>>> StreamStage                       0         0              0
>>> FlushWriter                       0         0             77
>>> FILEUTILS-DELETE-POOL             0         0             40
>>> MiscStage                         0         0              0
>>> FlushSorter                       0         0              0
>>> InternalResponseStage             0         0              4
>>> HintedHandoff                     0         0              0
>>>
>>>
>>> 1 $ ./nodetool -h localhost tpstats
>>> Pool Name                    Active   Pending      Completed
>>> ReadStage                         0         0              0
>>> RequestResponseStage              0         0       13796875
>>> MutationStage                     0         0       36116662
>>> ReadRepairStage                   0         0              0
>>> ReplicateOnWriteStage             0         0              0
>>> GossipStage                       0         0         336496
>>> AntiEntropyStage                  0         0              0
>>> MigrationStage                    0         0              4
>>> MemtablePostFlusher               0         0             91
>>> StreamStage                       0         0              0
>>> FlushWriter                       0         0             91
>>> FILEUTILS-DELETE-POOL             0         0             43
>>> MiscStage                         0         0              0
>>> FlushSorter                       0         0              0
>>> InternalResponseStage             0         0              4
>>> HintedHandoff                     0         0              0
>
>

Re: ReplicateOnWrite issues

Posted by David Hawthorne <dh...@gmx.3crowd.com>.

Well, I was using a large number of clients:  I tried configuring a hector pool of 20-200 to see what affect that had on throughput.  There's definitely a point after which there's no gain, so I dialed it back down.  To clarify a few other things, when I say inserts I mean increments, as this test only involves counters.  When I say a row is full, I mean I've written all of the data into it that belongs in it from our application's perspective.  Not all rows are the same length, and the average number of subcolumns per supercolumn is higher for some rows than for others.

I'm curious, is it necessary for a read to happen for replication when using RF=1?

I am currently running a test against a single node cluster, RF=1, CL.ONE, 4 connections/concurrent increments, and graphing various data points.  If something interesting happens, I'll send a graph.

When you say stagger writes, do you mean just lower the overall number of increments/sec my client is attempting, or do you mean I should stagger the load amongst many CFs/rows at once?  Right now I'm writing heavily into a small set of rows, each going into 3 CFs, before moving on to the next small set of rows.


On Jul 12, 2011, at 12:26 PM, Sylvain Lebresne wrote:

> When you do counter increment at CL.ONE, the write is acknowledged as
> soon as the first replica getting the the write has pushed the
> increment into his memtable. However, there is a read happening for
> the replication to the other replicas (this is necessary to the
> counter design). What is happening is that our nodes are not able to
> do those reads fast enough. Or, another way to see it is that you are
> writing faster than you cluster can handle.
> 
> I've opened https://issues.apache.org/jira/browse/CASSANDRA-2889 to do
> something about that, but for now you'll have to stagger your writes.
> I'm more surprised that tasks on the replicate on write stage would
> stack up at QUORUM, unless you're using a very large number of
> clients.
> 
> --
> Sylvain
> 
> On Tue, Jul 12, 2011 at 8:57 AM, David Hawthorne <dh...@gmx.3crowd.com> wrote:
>> This is now my fourth attempt to get the message through.  Apologies if you see multiple copies.
>> 
>> I've tried to give as much relevant data as I can think of, but please let me know if you need any other info.  I spent the day getting jmxtrans to talk to statsd with the cassandra JMX data, so I can provide graphs of just about anything you want.  I can also re-run the tests to replicate the problem if need be.  As always, thanks for taking a look!
>> 
>> I have a 4 host test cluster that I'm writing counters into running 0.8.1, and I'm writing to it with hector with CL.ONE or CL.QUORUM (test was run with each and both gave the same results).  Partitioner is the RandomPartitioner, and RF=3.  nodetool ring shows that actual data is well balanced:
>> 
>> Address         DC          Rack        Status State   Load            Owns    Token
>>                                                                              127605887595351923798765477786913079296
>> 10.0.0.57    datacenter1 rack1       Up     Normal  1.59 GB         25.00%  0
>> 10.0.0.56    datacenter1 rack1       Up     Normal  1.39 GB         25.00%  42535295865117307932921825928971026432
>> 10.0.0.55    datacenter1 rack1       Up     Normal  1.24 GB         25.00%  85070591730234615865843651857942052864
>> 10.0.0.54    datacenter1 rack1       Up     Normal  1.34 GB         25.00%  127605887595351923798765477786913079296
>> 
>> I also tested against a 1-box test cluster with RF=1.
>> 
>> Both show the same results:  high performance for a while, and then ReplicateOnWrite backs way up (I've seen it as high as a million), the cassandra process becomes unresponsive, and the hector client starts throwing exceptions.  High performance in this case equates to about 5-20k inserts/sec on both the 4-box and 1-box cluster (I'm inserting one long row at a time, so the performance is bounded by the one box owning the row).  The cassandra process never seems to recover, even if left for 24 hours.  It's still chewing through those ReplicateOnWrite pending tasks.
>> 
>> GC doesn't seem to be a factor.  Logs show GCs completing in < 1s on all 4 boxes in the test cluster, as well as on the 1 standalone server.
>> 
>> I'm also seeing highly erratic performance from the cluster in general:  inserts/sec usually start at around 3k, but then maybe creep up towards 6k, then drop back down to below 1k for an extended period of time, then maybe 1% of the time they'll spike up to 8k or even 15k, but that doesn't last more than 30 seconds.
>> 
>> Here's the tpstats output from all 4 boxes:
>> 
>> Slowest/most problematic/highest cpu usage one:
>> 
>> This tpstats output is also representative of the one box cluster about 20 minutes into my insert run.
>> 
>> 4 $ ./nodetool -h localhost tpstats
>> Pool Name                    Active   Pending      Completed
>> ReadStage                         0         0              0
>> RequestResponseStage              0         0       59095840
>> MutationStage                     0         0       31182942
>> ReadRepairStage                   0         0              0
>> ReplicateOnWriteStage            32    928042       28246716
>> GossipStage                       1         1         449464
>> AntiEntropyStage                  0         0              0
>> MigrationStage                    0         0              4
>> MemtablePostFlusher               0         0            445
>> StreamStage                       0         0              0
>> FlushWriter                       0         0            445
>> FILEUTILS-DELETE-POOL             0         0             38
>> MiscStage                         0         0              0
>> FlushSorter                       0         0              0
>> InternalResponseStage             0         0              4
>> HintedHandoff                     0         0              0
>> 
>> Second highest cpu usage one:
>> 
>> 3 $ ./nodetool -h localhost tpstats
>> Pool Name                    Active   Pending      Completed
>> ReadStage                         0         0              0
>> RequestResponseStage              0         0       27413910
>> MutationStage                     0         0       27523094
>> ReadRepairStage                   0         0              0
>> ReplicateOnWriteStage             0         0       11111971
>> GossipStage                       0         0         335422
>> AntiEntropyStage                  0         0              0
>> MigrationStage                    0         0              4
>> MemtablePostFlusher               0         0             66
>> StreamStage                       0         0              0
>> FlushWriter                       0         0             66
>> FILEUTILS-DELETE-POOL             0         0             31
>> MiscStage                         0         0              0
>> FlushSorter                       0         0              0
>> InternalResponseStage             0         0              4
>> HintedHandoff                     0         0              0
>> 
>> The other 2 that see minimal cpu usage:
>> 
>> 2 $ ./nodetool -h localhost tpstats
>> Pool Name                    Active   Pending      Completed
>> ReadStage                         0         0              0
>> RequestResponseStage              0         0        9213999
>> MutationStage                     0         0       33233114
>> ReadRepairStage                   0         0              0
>> ReplicateOnWriteStage             0         0              0
>> GossipStage                       0         0         336016
>> AntiEntropyStage                  0         0              0
>> MigrationStage                    0         0              4
>> MemtablePostFlusher               0         0             77
>> StreamStage                       0         0              0
>> FlushWriter                       0         0             77
>> FILEUTILS-DELETE-POOL             0         0             40
>> MiscStage                         0         0              0
>> FlushSorter                       0         0              0
>> InternalResponseStage             0         0              4
>> HintedHandoff                     0         0              0
>> 
>> 
>> 1 $ ./nodetool -h localhost tpstats
>> Pool Name                    Active   Pending      Completed
>> ReadStage                         0         0              0
>> RequestResponseStage              0         0       13796875
>> MutationStage                     0         0       36116662
>> ReadRepairStage                   0         0              0
>> ReplicateOnWriteStage             0         0              0
>> GossipStage                       0         0         336496
>> AntiEntropyStage                  0         0              0
>> MigrationStage                    0         0              4
>> MemtablePostFlusher               0         0             91
>> StreamStage                       0         0              0
>> FlushWriter                       0         0             91
>> FILEUTILS-DELETE-POOL             0         0             43
>> MiscStage                         0         0              0
>> FlushSorter                       0         0              0
>> InternalResponseStage             0         0              4
>> HintedHandoff                     0         0              0

Re: ReplicateOnWrite issues

Posted by Sylvain Lebresne <sy...@datastax.com>.

When you do counter increment at CL.ONE, the write is acknowledged as
soon as the first replica getting the the write has pushed the
increment into his memtable. However, there is a read happening for
the replication to the other replicas (this is necessary to the
counter design). What is happening is that our nodes are not able to
do those reads fast enough. Or, another way to see it is that you are
writing faster than you cluster can handle.

I've opened https://issues.apache.org/jira/browse/CASSANDRA-2889 to do
something about that, but for now you'll have to stagger your writes.
I'm more surprised that tasks on the replicate on write stage would
stack up at QUORUM, unless you're using a very large number of
clients.

--
Sylvain

On Tue, Jul 12, 2011 at 8:57 AM, David Hawthorne <dh...@gmx.3crowd.com> wrote:
> This is now my fourth attempt to get the message through.  Apologies if you see multiple copies.
>
> I've tried to give as much relevant data as I can think of, but please let me know if you need any other info.  I spent the day getting jmxtrans to talk to statsd with the cassandra JMX data, so I can provide graphs of just about anything you want.  I can also re-run the tests to replicate the problem if need be.  As always, thanks for taking a look!
>
> I have a 4 host test cluster that I'm writing counters into running 0.8.1, and I'm writing to it with hector with CL.ONE or CL.QUORUM (test was run with each and both gave the same results).  Partitioner is the RandomPartitioner, and RF=3.  nodetool ring shows that actual data is well balanced:
>
> Address         DC          Rack        Status State   Load            Owns    Token
>                                                                              127605887595351923798765477786913079296
> 10.0.0.57    datacenter1 rack1       Up     Normal  1.59 GB         25.00%  0
> 10.0.0.56    datacenter1 rack1       Up     Normal  1.39 GB         25.00%  42535295865117307932921825928971026432
> 10.0.0.55    datacenter1 rack1       Up     Normal  1.24 GB         25.00%  85070591730234615865843651857942052864
> 10.0.0.54    datacenter1 rack1       Up     Normal  1.34 GB         25.00%  127605887595351923798765477786913079296
>
> I also tested against a 1-box test cluster with RF=1.
>
> Both show the same results:  high performance for a while, and then ReplicateOnWrite backs way up (I've seen it as high as a million), the cassandra process becomes unresponsive, and the hector client starts throwing exceptions.  High performance in this case equates to about 5-20k inserts/sec on both the 4-box and 1-box cluster (I'm inserting one long row at a time, so the performance is bounded by the one box owning the row).  The cassandra process never seems to recover, even if left for 24 hours.  It's still chewing through those ReplicateOnWrite pending tasks.
>
> GC doesn't seem to be a factor.  Logs show GCs completing in < 1s on all 4 boxes in the test cluster, as well as on the 1 standalone server.
>
> I'm also seeing highly erratic performance from the cluster in general:  inserts/sec usually start at around 3k, but then maybe creep up towards 6k, then drop back down to below 1k for an extended period of time, then maybe 1% of the time they'll spike up to 8k or even 15k, but that doesn't last more than 30 seconds.
>
> Here's the tpstats output from all 4 boxes:
>
> Slowest/most problematic/highest cpu usage one:
>
> This tpstats output is also representative of the one box cluster about 20 minutes into my insert run.
>
> 4 $ ./nodetool -h localhost tpstats
> Pool Name                    Active   Pending      Completed
> ReadStage                         0         0              0
> RequestResponseStage              0         0       59095840
> MutationStage                     0         0       31182942
> ReadRepairStage                   0         0              0
> ReplicateOnWriteStage            32    928042       28246716
> GossipStage                       1         1         449464
> AntiEntropyStage                  0         0              0
> MigrationStage                    0         0              4
> MemtablePostFlusher               0         0            445
> StreamStage                       0         0              0
> FlushWriter                       0         0            445
> FILEUTILS-DELETE-POOL             0         0             38
> MiscStage                         0         0              0
> FlushSorter                       0         0              0
> InternalResponseStage             0         0              4
> HintedHandoff                     0         0              0
>
> Second highest cpu usage one:
>
> 3 $ ./nodetool -h localhost tpstats
> Pool Name                    Active   Pending      Completed
> ReadStage                         0         0              0
> RequestResponseStage              0         0       27413910
> MutationStage                     0         0       27523094
> ReadRepairStage                   0         0              0
> ReplicateOnWriteStage             0         0       11111971
> GossipStage                       0         0         335422
> AntiEntropyStage                  0         0              0
> MigrationStage                    0         0              4
> MemtablePostFlusher               0         0             66
> StreamStage                       0         0              0
> FlushWriter                       0         0             66
> FILEUTILS-DELETE-POOL             0         0             31
> MiscStage                         0         0              0
> FlushSorter                       0         0              0
> InternalResponseStage             0         0              4
> HintedHandoff                     0         0              0
>
> The other 2 that see minimal cpu usage:
>
> 2 $ ./nodetool -h localhost tpstats
> Pool Name                    Active   Pending      Completed
> ReadStage                         0         0              0
> RequestResponseStage              0         0        9213999
> MutationStage                     0         0       33233114
> ReadRepairStage                   0         0              0
> ReplicateOnWriteStage             0         0              0
> GossipStage                       0         0         336016
> AntiEntropyStage                  0         0              0
> MigrationStage                    0         0              4
> MemtablePostFlusher               0         0             77
> StreamStage                       0         0              0
> FlushWriter                       0         0             77
> FILEUTILS-DELETE-POOL             0         0             40
> MiscStage                         0         0              0
> FlushSorter                       0         0              0
> InternalResponseStage             0         0              4
> HintedHandoff                     0         0              0
>
>
> 1 $ ./nodetool -h localhost tpstats
> Pool Name                    Active   Pending      Completed
> ReadStage                         0         0              0
> RequestResponseStage              0         0       13796875
> MutationStage                     0         0       36116662
> ReadRepairStage                   0         0              0
> ReplicateOnWriteStage             0         0              0
> GossipStage                       0         0         336496
> AntiEntropyStage                  0         0              0
> MigrationStage                    0         0              4
> MemtablePostFlusher               0         0             91
> StreamStage                       0         0              0
> FlushWriter                       0         0             91
> FILEUTILS-DELETE-POOL             0         0             43
> MiscStage                         0         0              0
> FlushSorter                       0         0              0
> InternalResponseStage             0         0              4
> HintedHandoff                     0         0              0

Re: ReplicateOnWrite issues

Posted by Yang <te...@gmail.com>.

what do you mean by "until they are full" ?

right now I guess a quick black-box testing method for this problem is
to try inserting only shorter rows , and see if that persists.

as you said, it could be that addReadCommandFromColumnFamily is taking
a lot of time to read, if that's from disk, it's going to be slow, can
u try to use a huge memory and just test with those few rows of
counters, so that they are in memory? if this is still slow, u may try
JIRA cassandra-2843  (for long row reading)



On Tue, Jul 12, 2011 at 12:37 AM, David Hawthorne <dh...@gmx.3crowd.com> wrote:
> It's definitely for counters, and some of the rows I'm inserting are long-ish, if 1.3MB is long.
>
> Maybe it would help if I said I was using counter super columns.  I'm also writing to only a handful of rows at a time, until they are full.  It looks like the counter super column code in addReadCommandFromColumnFamily, in the ReplicateOnWrite handling, might not be optimized for this kind of load.
>
> It looks like I may be able to turn off replicate_on_write for the CFs as long as I use CL.ONE to get around this in the short term, but I'd like to know what happens when replicate on write is true and when it is false so I can make a more informed choice.
>
> On Jul 12, 2011, at 12:20 AM, Yang wrote:
>
>> interesting,
>>
>> first just to make sure: since replicateOnWrite is for Counters, you
>> are using counters (you use the word "insert" instead of
>> "add/increment" ) right?
>>
>> if you are using counters, supposedly the leader runs
>> replicateOnWrite, somehow all your adds find the one box as leader,
>> that's probably something worth investigating.
>>
>> finally, u could try debugging into those replicateOnWrite tasks and
>> see what they are doing exactly, maybe there is something immediately
>> wrong
>>
>>
>>
>> On Mon, Jul 11, 2011 at 11:57 PM, David Hawthorne <dh...@gmx.3crowd.com> wrote:
>>> This is now my fourth attempt to get the message through.  Apologies if you see multiple copies.
>>>
>>> I've tried to give as much relevant data as I can think of, but please let me know if you need any other info.  I spent the day getting jmxtrans to talk to statsd with the cassandra JMX data, so I can provide graphs of just about anything you want.  I can also re-run the tests to replicate the problem if need be.  As always, thanks for taking a look!
>>>
>>> I have a 4 host test cluster that I'm writing counters into running 0.8.1, and I'm writing to it with hector with CL.ONE or CL.QUORUM (test was run with each and both gave the same results).  Partitioner is the RandomPartitioner, and RF=3.  nodetool ring shows that actual data is well balanced:
>>>
>>> Address         DC          Rack        Status State   Load            Owns    Token
>>>                                                                              127605887595351923798765477786913079296
>>> 10.0.0.57    datacenter1 rack1       Up     Normal  1.59 GB         25.00%  0
>>> 10.0.0.56    datacenter1 rack1       Up     Normal  1.39 GB         25.00%  42535295865117307932921825928971026432
>>> 10.0.0.55    datacenter1 rack1       Up     Normal  1.24 GB         25.00%  85070591730234615865843651857942052864
>>> 10.0.0.54    datacenter1 rack1       Up     Normal  1.34 GB         25.00%  127605887595351923798765477786913079296
>>>
>>> I also tested against a 1-box test cluster with RF=1.
>>>
>>> Both show the same results:  high performance for a while, and then ReplicateOnWrite backs way up (I've seen it as high as a million), the cassandra process becomes unresponsive, and the hector client starts throwing exceptions.  High performance in this case equates to about 5-20k inserts/sec on both the 4-box and 1-box cluster (I'm inserting one long row at a time, so the performance is bounded by the one box owning the row).  The cassandra process never seems to recover, even if left for 24 hours.  It's still chewing through those ReplicateOnWrite pending tasks.
>>>
>>> GC doesn't seem to be a factor.  Logs show GCs completing in < 1s on all 4 boxes in the test cluster, as well as on the 1 standalone server.
>>>
>>> I'm also seeing highly erratic performance from the cluster in general:  inserts/sec usually start at around 3k, but then maybe creep up towards 6k, then drop back down to below 1k for an extended period of time, then maybe 1% of the time they'll spike up to 8k or even 15k, but that doesn't last more than 30 seconds.
>>>
>>> Here's the tpstats output from all 4 boxes:
>>>
>>> Slowest/most problematic/highest cpu usage one:
>>>
>>> This tpstats output is also representative of the one box cluster about 20 minutes into my insert run.
>>>
>>> 4 $ ./nodetool -h localhost tpstats
>>> Pool Name                    Active   Pending      Completed
>>> ReadStage                         0         0              0
>>> RequestResponseStage              0         0       59095840
>>> MutationStage                     0         0       31182942
>>> ReadRepairStage                   0         0              0
>>> ReplicateOnWriteStage            32    928042       28246716
>>> GossipStage                       1         1         449464
>>> AntiEntropyStage                  0         0              0
>>> MigrationStage                    0         0              4
>>> MemtablePostFlusher               0         0            445
>>> StreamStage                       0         0              0
>>> FlushWriter                       0         0            445
>>> FILEUTILS-DELETE-POOL             0         0             38
>>> MiscStage                         0         0              0
>>> FlushSorter                       0         0              0
>>> InternalResponseStage             0         0              4
>>> HintedHandoff                     0         0              0
>>>
>>> Second highest cpu usage one:
>>>
>>> 3 $ ./nodetool -h localhost tpstats
>>> Pool Name                    Active   Pending      Completed
>>> ReadStage                         0         0              0
>>> RequestResponseStage              0         0       27413910
>>> MutationStage                     0         0       27523094
>>> ReadRepairStage                   0         0              0
>>> ReplicateOnWriteStage             0         0       11111971
>>> GossipStage                       0         0         335422
>>> AntiEntropyStage                  0         0              0
>>> MigrationStage                    0         0              4
>>> MemtablePostFlusher               0         0             66
>>> StreamStage                       0         0              0
>>> FlushWriter                       0         0             66
>>> FILEUTILS-DELETE-POOL             0         0             31
>>> MiscStage                         0         0              0
>>> FlushSorter                       0         0              0
>>> InternalResponseStage             0         0              4
>>> HintedHandoff                     0         0              0
>>>
>>> The other 2 that see minimal cpu usage:
>>>
>>> 2 $ ./nodetool -h localhost tpstats
>>> Pool Name                    Active   Pending      Completed
>>> ReadStage                         0         0              0
>>> RequestResponseStage              0         0        9213999
>>> MutationStage                     0         0       33233114
>>> ReadRepairStage                   0         0              0
>>> ReplicateOnWriteStage             0         0              0
>>> GossipStage                       0         0         336016
>>> AntiEntropyStage                  0         0              0
>>> MigrationStage                    0         0              4
>>> MemtablePostFlusher               0         0             77
>>> StreamStage                       0         0              0
>>> FlushWriter                       0         0             77
>>> FILEUTILS-DELETE-POOL             0         0             40
>>> MiscStage                         0         0              0
>>> FlushSorter                       0         0              0
>>> InternalResponseStage             0         0              4
>>> HintedHandoff                     0         0              0
>>>
>>>
>>> 1 $ ./nodetool -h localhost tpstats
>>> Pool Name                    Active   Pending      Completed
>>> ReadStage                         0         0              0
>>> RequestResponseStage              0         0       13796875
>>> MutationStage                     0         0       36116662
>>> ReadRepairStage                   0         0              0
>>> ReplicateOnWriteStage             0         0              0
>>> GossipStage                       0         0         336496
>>> AntiEntropyStage                  0         0              0
>>> MigrationStage                    0         0              4
>>> MemtablePostFlusher               0         0             91
>>> StreamStage                       0         0              0
>>> FlushWriter                       0         0             91
>>> FILEUTILS-DELETE-POOL             0         0             43
>>> MiscStage                         0         0              0
>>> FlushSorter                       0         0              0
>>> InternalResponseStage             0         0              4
>>> HintedHandoff                     0         0              0
>
>

Re: ReplicateOnWrite issues

Posted by David Hawthorne <dh...@gmx.3crowd.com>.

It's definitely for counters, and some of the rows I'm inserting are long-ish, if 1.3MB is long.

Maybe it would help if I said I was using counter super columns.  I'm also writing to only a handful of rows at a time, until they are full.  It looks like the counter super column code in addReadCommandFromColumnFamily, in the ReplicateOnWrite handling, might not be optimized for this kind of load.

It looks like I may be able to turn off replicate_on_write for the CFs as long as I use CL.ONE to get around this in the short term, but I'd like to know what happens when replicate on write is true and when it is false so I can make a more informed choice.

On Jul 12, 2011, at 12:20 AM, Yang wrote:

> interesting,
> 
> first just to make sure: since replicateOnWrite is for Counters, you
> are using counters (you use the word "insert" instead of
> "add/increment" ) right?
> 
> if you are using counters, supposedly the leader runs
> replicateOnWrite, somehow all your adds find the one box as leader,
> that's probably something worth investigating.
> 
> finally, u could try debugging into those replicateOnWrite tasks and
> see what they are doing exactly, maybe there is something immediately
> wrong
> 
> 
> 
> On Mon, Jul 11, 2011 at 11:57 PM, David Hawthorne <dh...@gmx.3crowd.com> wrote:
>> This is now my fourth attempt to get the message through.  Apologies if you see multiple copies.
>> 
>> I've tried to give as much relevant data as I can think of, but please let me know if you need any other info.  I spent the day getting jmxtrans to talk to statsd with the cassandra JMX data, so I can provide graphs of just about anything you want.  I can also re-run the tests to replicate the problem if need be.  As always, thanks for taking a look!
>> 
>> I have a 4 host test cluster that I'm writing counters into running 0.8.1, and I'm writing to it with hector with CL.ONE or CL.QUORUM (test was run with each and both gave the same results).  Partitioner is the RandomPartitioner, and RF=3.  nodetool ring shows that actual data is well balanced:
>> 
>> Address         DC          Rack        Status State   Load            Owns    Token
>>                                                                              127605887595351923798765477786913079296
>> 10.0.0.57    datacenter1 rack1       Up     Normal  1.59 GB         25.00%  0
>> 10.0.0.56    datacenter1 rack1       Up     Normal  1.39 GB         25.00%  42535295865117307932921825928971026432
>> 10.0.0.55    datacenter1 rack1       Up     Normal  1.24 GB         25.00%  85070591730234615865843651857942052864
>> 10.0.0.54    datacenter1 rack1       Up     Normal  1.34 GB         25.00%  127605887595351923798765477786913079296
>> 
>> I also tested against a 1-box test cluster with RF=1.
>> 
>> Both show the same results:  high performance for a while, and then ReplicateOnWrite backs way up (I've seen it as high as a million), the cassandra process becomes unresponsive, and the hector client starts throwing exceptions.  High performance in this case equates to about 5-20k inserts/sec on both the 4-box and 1-box cluster (I'm inserting one long row at a time, so the performance is bounded by the one box owning the row).  The cassandra process never seems to recover, even if left for 24 hours.  It's still chewing through those ReplicateOnWrite pending tasks.
>> 
>> GC doesn't seem to be a factor.  Logs show GCs completing in < 1s on all 4 boxes in the test cluster, as well as on the 1 standalone server.
>> 
>> I'm also seeing highly erratic performance from the cluster in general:  inserts/sec usually start at around 3k, but then maybe creep up towards 6k, then drop back down to below 1k for an extended period of time, then maybe 1% of the time they'll spike up to 8k or even 15k, but that doesn't last more than 30 seconds.
>> 
>> Here's the tpstats output from all 4 boxes:
>> 
>> Slowest/most problematic/highest cpu usage one:
>> 
>> This tpstats output is also representative of the one box cluster about 20 minutes into my insert run.
>> 
>> 4 $ ./nodetool -h localhost tpstats
>> Pool Name                    Active   Pending      Completed
>> ReadStage                         0         0              0
>> RequestResponseStage              0         0       59095840
>> MutationStage                     0         0       31182942
>> ReadRepairStage                   0         0              0
>> ReplicateOnWriteStage            32    928042       28246716
>> GossipStage                       1         1         449464
>> AntiEntropyStage                  0         0              0
>> MigrationStage                    0         0              4
>> MemtablePostFlusher               0         0            445
>> StreamStage                       0         0              0
>> FlushWriter                       0         0            445
>> FILEUTILS-DELETE-POOL             0         0             38
>> MiscStage                         0         0              0
>> FlushSorter                       0         0              0
>> InternalResponseStage             0         0              4
>> HintedHandoff                     0         0              0
>> 
>> Second highest cpu usage one:
>> 
>> 3 $ ./nodetool -h localhost tpstats
>> Pool Name                    Active   Pending      Completed
>> ReadStage                         0         0              0
>> RequestResponseStage              0         0       27413910
>> MutationStage                     0         0       27523094
>> ReadRepairStage                   0         0              0
>> ReplicateOnWriteStage             0         0       11111971
>> GossipStage                       0         0         335422
>> AntiEntropyStage                  0         0              0
>> MigrationStage                    0         0              4
>> MemtablePostFlusher               0         0             66
>> StreamStage                       0         0              0
>> FlushWriter                       0         0             66
>> FILEUTILS-DELETE-POOL             0         0             31
>> MiscStage                         0         0              0
>> FlushSorter                       0         0              0
>> InternalResponseStage             0         0              4
>> HintedHandoff                     0         0              0
>> 
>> The other 2 that see minimal cpu usage:
>> 
>> 2 $ ./nodetool -h localhost tpstats
>> Pool Name                    Active   Pending      Completed
>> ReadStage                         0         0              0
>> RequestResponseStage              0         0        9213999
>> MutationStage                     0         0       33233114
>> ReadRepairStage                   0         0              0
>> ReplicateOnWriteStage             0         0              0
>> GossipStage                       0         0         336016
>> AntiEntropyStage                  0         0              0
>> MigrationStage                    0         0              4
>> MemtablePostFlusher               0         0             77
>> StreamStage                       0         0              0
>> FlushWriter                       0         0             77
>> FILEUTILS-DELETE-POOL             0         0             40
>> MiscStage                         0         0              0
>> FlushSorter                       0         0              0
>> InternalResponseStage             0         0              4
>> HintedHandoff                     0         0              0
>> 
>> 
>> 1 $ ./nodetool -h localhost tpstats
>> Pool Name                    Active   Pending      Completed
>> ReadStage                         0         0              0
>> RequestResponseStage              0         0       13796875
>> MutationStage                     0         0       36116662
>> ReadRepairStage                   0         0              0
>> ReplicateOnWriteStage             0         0              0
>> GossipStage                       0         0         336496
>> AntiEntropyStage                  0         0              0
>> MigrationStage                    0         0              4
>> MemtablePostFlusher               0         0             91
>> StreamStage                       0         0              0
>> FlushWriter                       0         0             91
>> FILEUTILS-DELETE-POOL             0         0             43
>> MiscStage                         0         0              0
>> FlushSorter                       0         0              0
>> InternalResponseStage             0         0              4
>> HintedHandoff                     0         0              0

Re: ReplicateOnWrite issues

Posted by Yang <te...@gmail.com>.

interesting,

first just to make sure: since replicateOnWrite is for Counters, you
are using counters (you use the word "insert" instead of
"add/increment" ) right?

if you are using counters, supposedly the leader runs
replicateOnWrite, somehow all your adds find the one box as leader,
that's probably something worth investigating.

finally, u could try debugging into those replicateOnWrite tasks and
see what they are doing exactly, maybe there is something immediately
wrong



On Mon, Jul 11, 2011 at 11:57 PM, David Hawthorne <dh...@gmx.3crowd.com> wrote:
> This is now my fourth attempt to get the message through.  Apologies if you see multiple copies.
>
> I've tried to give as much relevant data as I can think of, but please let me know if you need any other info.  I spent the day getting jmxtrans to talk to statsd with the cassandra JMX data, so I can provide graphs of just about anything you want.  I can also re-run the tests to replicate the problem if need be.  As always, thanks for taking a look!
>
> I have a 4 host test cluster that I'm writing counters into running 0.8.1, and I'm writing to it with hector with CL.ONE or CL.QUORUM (test was run with each and both gave the same results).  Partitioner is the RandomPartitioner, and RF=3.  nodetool ring shows that actual data is well balanced:
>
> Address         DC          Rack        Status State   Load            Owns    Token
>                                                                              127605887595351923798765477786913079296
> 10.0.0.57    datacenter1 rack1       Up     Normal  1.59 GB         25.00%  0
> 10.0.0.56    datacenter1 rack1       Up     Normal  1.39 GB         25.00%  42535295865117307932921825928971026432
> 10.0.0.55    datacenter1 rack1       Up     Normal  1.24 GB         25.00%  85070591730234615865843651857942052864
> 10.0.0.54    datacenter1 rack1       Up     Normal  1.34 GB         25.00%  127605887595351923798765477786913079296
>
> I also tested against a 1-box test cluster with RF=1.
>
> Both show the same results:  high performance for a while, and then ReplicateOnWrite backs way up (I've seen it as high as a million), the cassandra process becomes unresponsive, and the hector client starts throwing exceptions.  High performance in this case equates to about 5-20k inserts/sec on both the 4-box and 1-box cluster (I'm inserting one long row at a time, so the performance is bounded by the one box owning the row).  The cassandra process never seems to recover, even if left for 24 hours.  It's still chewing through those ReplicateOnWrite pending tasks.
>
> GC doesn't seem to be a factor.  Logs show GCs completing in < 1s on all 4 boxes in the test cluster, as well as on the 1 standalone server.
>
> I'm also seeing highly erratic performance from the cluster in general:  inserts/sec usually start at around 3k, but then maybe creep up towards 6k, then drop back down to below 1k for an extended period of time, then maybe 1% of the time they'll spike up to 8k or even 15k, but that doesn't last more than 30 seconds.
>
> Here's the tpstats output from all 4 boxes:
>
> Slowest/most problematic/highest cpu usage one:
>
> This tpstats output is also representative of the one box cluster about 20 minutes into my insert run.
>
> 4 $ ./nodetool -h localhost tpstats
> Pool Name                    Active   Pending      Completed
> ReadStage                         0         0              0
> RequestResponseStage              0         0       59095840
> MutationStage                     0         0       31182942
> ReadRepairStage                   0         0              0
> ReplicateOnWriteStage            32    928042       28246716
> GossipStage                       1         1         449464
> AntiEntropyStage                  0         0              0
> MigrationStage                    0         0              4
> MemtablePostFlusher               0         0            445
> StreamStage                       0         0              0
> FlushWriter                       0         0            445
> FILEUTILS-DELETE-POOL             0         0             38
> MiscStage                         0         0              0
> FlushSorter                       0         0              0
> InternalResponseStage             0         0              4
> HintedHandoff                     0         0              0
>
> Second highest cpu usage one:
>
> 3 $ ./nodetool -h localhost tpstats
> Pool Name                    Active   Pending      Completed
> ReadStage                         0         0              0
> RequestResponseStage              0         0       27413910
> MutationStage                     0         0       27523094
> ReadRepairStage                   0         0              0
> ReplicateOnWriteStage             0         0       11111971
> GossipStage                       0         0         335422
> AntiEntropyStage                  0         0              0
> MigrationStage                    0         0              4
> MemtablePostFlusher               0         0             66
> StreamStage                       0         0              0
> FlushWriter                       0         0             66
> FILEUTILS-DELETE-POOL             0         0             31
> MiscStage                         0         0              0
> FlushSorter                       0         0              0
> InternalResponseStage             0         0              4
> HintedHandoff                     0         0              0
>
> The other 2 that see minimal cpu usage:
>
> 2 $ ./nodetool -h localhost tpstats
> Pool Name                    Active   Pending      Completed
> ReadStage                         0         0              0
> RequestResponseStage              0         0        9213999
> MutationStage                     0         0       33233114
> ReadRepairStage                   0         0              0
> ReplicateOnWriteStage             0         0              0
> GossipStage                       0         0         336016
> AntiEntropyStage                  0         0              0
> MigrationStage                    0         0              4
> MemtablePostFlusher               0         0             77
> StreamStage                       0         0              0
> FlushWriter                       0         0             77
> FILEUTILS-DELETE-POOL             0         0             40
> MiscStage                         0         0              0
> FlushSorter                       0         0              0
> InternalResponseStage             0         0              4
> HintedHandoff                     0         0              0
>
>
> 1 $ ./nodetool -h localhost tpstats
> Pool Name                    Active   Pending      Completed
> ReadStage                         0         0              0
> RequestResponseStage              0         0       13796875
> MutationStage                     0         0       36116662
> ReadRepairStage                   0         0              0
> ReplicateOnWriteStage             0         0              0
> GossipStage                       0         0         336496
> AntiEntropyStage                  0         0              0
> MigrationStage                    0         0              4
> MemtablePostFlusher               0         0             91
> StreamStage                       0         0              0
> FlushWriter                       0         0             91
> FILEUTILS-DELETE-POOL             0         0             43
> MiscStage                         0         0              0
> FlushSorter                       0         0              0
> InternalResponseStage             0         0              4
> HintedHandoff                     0         0              0