You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Mike Gallamore <mi...@googlemail.com> on 2010/04/07 22:19:26 UTC

writes to Cassandra failing occasionally

I have writes to cassandra that are failing, or at least a read shortly 
after a write is still getting an old value. I realize Cassandra is 
"eventually consistent" but this system is a single CPU single node with 
consistency level set to 1, so this seems odd to me.

My setup:

Cassandra 0.6rc1

client: writing using Net::Cassandra::Easy 0.09 (perl thrift interface)

Cluster: single node. Default out of the box setup. Single CPU system 
(its a P4 dinosaur). The client library defaults to read and write 
consistency of 1 which is the number of nodes so is like "all" in my case.

I have a test that modifies data and then tests to see if they have the 
expected value. These tests pass sometimes and sometimes fail part way 
through the test suite. This doesn't seem to be predictable, stopping 
cassandra, nuking the data directory and restarting doesn't make the 
system any more likely to pass. Passing or failing the test one time 
doesn't seem to correlate with the pass or failure of the next attempt. 
What I'm seeing is old data still being returned to get requests after 
the mutate has run.

For example: I modified the "set" method of my program to output the 
value being saved, the old value and the new value and I see things like 
this when the tests fail:

Expected value was
55, but got
72 instead, previous value was
72 at lib/ET/Storage/Cassandra.pm line 96.

This means the previous value was 72, the modifications that where done 
should have resulted in the score being 55 but I got a 72 instead. Any 
ideas what could cause this?

As an aside I motified some other code to use Net::Cassandra instead of 
Net::Cassandra::Easy and noticed that it seems to run 3-4X slower. Both 
aren't stunningly fast. The test clients are running on the same machine 
as Cassandra, and I'm only getting somewhere between 100-400 (huge 
variance) with N::C::Easy and 30-90 with N::C. This test is writing key 
value pairs, with the keys being an incrementing numbber, and the values 
being a log line from one of our systems (~200 character string). I'm 
surprised there is such a huge difference in speed between the two 
modules and that the transactions per second are so low even on my 
3.2Ghz P4 2GB RAM box. I tried dropping the consistency level down to 
zero but it had a negligible affect.






Re: writes to Cassandra failing occasionally

Posted by Mike Gallamore <mi...@googlemail.com>.
That makes sense. My data is coming in from the internet and is being
processed in chunks as it is using Active MQ with the stomp package. I'm
getting the log lines in 20-1000 line chunks (depending on the busyness of
customer sites) so there definitely is the potential for a lot of
parallelism. Some of my data will likely be in cache during write because of
the nature of the work. It's a reputation system so I first get a query from
the customer for the reputation and then afterwards within a minute or so
I'll get feedback back from them what the current events "score" was that
feedbacks into the system for updates of the value. Anyways lots of
parallelism opportunities.

2010/4/9 Ted Zlatanov <tz...@lifelogs.com>

> On Thu, 08 Apr 2010 11:50:38 -0700 Mike Gallamore <
> mike.e.gallamore@googlemail.com> wrote:
>
> MG> Yes I agree single threaded is probably not the best. I wonder how
> MG> much of a performance hit it is on a single CPU machine though? I
> MG> guess I still would be blocking on ram writes but isn't like there is
> MG> multiple CPUs I need to keep busy or anything.
>
> Cassandra may have to load data from disk for a particular query but
> another may already be in memory.  A third may cause a hit on another
> cluster node.  So if you issue queries serially you'll see performance
> drop off with the total number of queries because they are dependent on
> each other's performance, while the distribution of the performance of
> independent parallel queries will have skew and kurtosis much closer to
> a normal distribution.  In other words, your slowest (or unluckiest)
> queries are less damaging when you issue them in parallel.
>
> On the client side you still have slow serialization/deserialization and
> not much can be done about that.
>
> Ted
>
>

Re: writes to Cassandra failing occasionally

Posted by Ted Zlatanov <tz...@lifelogs.com>.
On Thu, 08 Apr 2010 11:50:38 -0700 Mike Gallamore <mi...@googlemail.com> wrote: 

MG> Yes I agree single threaded is probably not the best. I wonder how
MG> much of a performance hit it is on a single CPU machine though? I
MG> guess I still would be blocking on ram writes but isn't like there is
MG> multiple CPUs I need to keep busy or anything.

Cassandra may have to load data from disk for a particular query but
another may already be in memory.  A third may cause a hit on another
cluster node.  So if you issue queries serially you'll see performance
drop off with the total number of queries because they are dependent on
each other's performance, while the distribution of the performance of
independent parallel queries will have skew and kurtosis much closer to
a normal distribution.  In other words, your slowest (or unluckiest)
queries are less damaging when you issue them in parallel.

On the client side you still have slow serialization/deserialization and
not much can be done about that.

Ted


Re: writes to Cassandra failing occasionally

Posted by Mike Gallamore <mi...@googlemail.com>.
I'll work on making a benchmark sometime latter. But I don't think that 
my changes would be batched. My rows only have one column and for this 
test each row is only accessed once (when it is written), I pretty much 
directly mapped over from a key value store that was using memcache before.

It is something like:

key(5-10 char)       log_line (~130 char) eg.

test-1    [2010-04-08 14:46:19 -0400] [6704] tc=4.3.48 
i=119.93.29.92:4817 h=[UNAVAILABLE] o=N t= u= a= m= r= ir= l=CONN c=550 
z="Connection refused" id=74a26fe4 q= x="RBL 
action=reject;zen.dnsbl=127.0.0.4.reject;throttle.bobsbolts.com=127.0.0.99.throttle;t=0,0" 
n=82/55/1 f=24 p=0 s=""   d=0.00
test-2    [2010-04-08 14:46:19 -0400] [6700] tc=4.3.48 
i=204.16.8.227:29873 h=[UNAVAILABLE] o=N t= u= a= m= r= ir= l=CONN c=550 
z="Connection refused" id=4e2c43a6 q= x="RBL 
action=reject;zen.dnsbl=127.0.0.4.reject;zen.dnsbl=127.0.0.11.reject;throttle.bobsbolts.com=127.0.0.99.throttle;t=0,0" 
n=154/115/1 f=24 p=0 s=""   d=0.03

Yes I agree single threaded is probably not the best. I wonder how much 
of a performance hit it is on a single CPU machine though? I guess I 
still would be blocking on ram writes but isn't like there is multiple 
CPUs I need to keep busy or anything.
On 04/08/2010 06:07 AM, Ted Zlatanov wrote:
> On Wed, 07 Apr 2010 13:19:26 -0700 Mike Gallamore<mi...@googlemail.com>  wrote:
>
> MG>  As an aside I motified some other code to use Net::Cassandra instead
> MG>  of Net::Cassandra::Easy and noticed that it seems to run 3-4X
> MG>  slower. Both aren't stunningly fast. The test clients are running on
> MG>  the same machine as Cassandra, and I'm only getting somewhere between
> MG>  100-400 (huge variance) with N::C::Easy and 30-90 with N::C. This test
> MG>  is writing key value pairs, with the keys being an incrementing
> MG>  numbber, and the values being a log line from one of our systems (~200
> MG>  character string). I'm surprised there is such a huge difference in
> MG>  speed between the two modules and that the transactions per second are
> MG>  so low even on my 3.2Ghz P4 2GB RAM box. I tried dropping the
> MG>  consistency level down to zero but it had a negligible affect.
>
> First of all, Thrift and the way it's implemented in pure Perl
> (Inline::C or XS would have been much better, plus the data structures
> are horrible) are IMO the most annoying thing about working with
> Cassandra.  I proposed a pluggable API mechanism so users don't have to
> depend on Thrift but the proposal was rejected, so for now Thrift (with
> the crash-on-demand feature) is the only actively developed Cassandra
> API.  Avro is supposed to be happening soon and I look forward to that.
>
> You should benchmark your code; make sure you're comparing apples to
> apples.  N::C::Easy wraps the operations for you, always using multigets
> and mutations on the backend.  I don't know how your Net::Cassandra test
> is implemented.  It may be you're making multiple requests when you only
> need one.  But more importantly, unless you fork multiple processes you
> won't be winning any speed races.  Use Tie::ShareLite, for example, to
> synchronize your data structures through shared memory.
>
> If you can put together benchmarks that run against the default
> (Keyspace1) configuration, I can try to optimize things.  I won't be
> rewriting the Thrift side, so it will still be slow on
> serialize/deserialize operations, but everything else will be fixed if
> it's suboptimal.
>
> Ted
>
>    


Re: writes to Cassandra failing occasionally

Posted by Ted Zlatanov <tz...@lifelogs.com>.
On Wed, 07 Apr 2010 13:19:26 -0700 Mike Gallamore <mi...@googlemail.com> wrote: 

MG> As an aside I motified some other code to use Net::Cassandra instead
MG> of Net::Cassandra::Easy and noticed that it seems to run 3-4X
MG> slower. Both aren't stunningly fast. The test clients are running on
MG> the same machine as Cassandra, and I'm only getting somewhere between
MG> 100-400 (huge variance) with N::C::Easy and 30-90 with N::C. This test
MG> is writing key value pairs, with the keys being an incrementing
MG> numbber, and the values being a log line from one of our systems (~200
MG> character string). I'm surprised there is such a huge difference in
MG> speed between the two modules and that the transactions per second are
MG> so low even on my 3.2Ghz P4 2GB RAM box. I tried dropping the
MG> consistency level down to zero but it had a negligible affect.

First of all, Thrift and the way it's implemented in pure Perl
(Inline::C or XS would have been much better, plus the data structures
are horrible) are IMO the most annoying thing about working with
Cassandra.  I proposed a pluggable API mechanism so users don't have to
depend on Thrift but the proposal was rejected, so for now Thrift (with
the crash-on-demand feature) is the only actively developed Cassandra
API.  Avro is supposed to be happening soon and I look forward to that.

You should benchmark your code; make sure you're comparing apples to
apples.  N::C::Easy wraps the operations for you, always using multigets
and mutations on the backend.  I don't know how your Net::Cassandra test
is implemented.  It may be you're making multiple requests when you only
need one.  But more importantly, unless you fork multiple processes you
won't be winning any speed races.  Use Tie::ShareLite, for example, to
synchronize your data structures through shared memory.

If you can put together benchmarks that run against the default
(Keyspace1) configuration, I can try to optimize things.  I won't be
rewriting the Thrift side, so it will still be slow on
serialize/deserialize operations, but everything else will be fixed if
it's suboptimal.

Ted


Re: writes to Cassandra failing occasionally

Posted by Mike Gallamore <mi...@googlemail.com>.
On 04/07/2010 01:31 PM, Eric Evans wrote:
> On Wed, 2010-04-07 at 13:19 -0700, Mike Gallamore wrote:
>    
>> I have writes to cassandra that are failing, or at least a read
>> shortly after a write is still getting an old value. I realize
>> Cassandra is "eventually consistent" but this system is a single CPU
>> single node with consistency level set to 1, so this seems odd to me.
>>      
> I don't know anything about the Net::Cassandra module, but the update
> needs to have a greater timestamp than the original write, and it sounds
> like that might not be happening here.
>
>    
Yes this was the problem. There appears to have been a bug in the module 
I have a patch that I'll post in another reply.

Re: writes to Cassandra failing occasionally

Posted by Eric Evans <ee...@rackspace.com>.
On Wed, 2010-04-07 at 13:19 -0700, Mike Gallamore wrote:
> I have writes to cassandra that are failing, or at least a read
> shortly after a write is still getting an old value. I realize
> Cassandra is "eventually consistent" but this system is a single CPU
> single node with consistency level set to 1, so this seems odd to me.

I don't know anything about the Net::Cassandra module, but the update
needs to have a greater timestamp than the original write, and it sounds
like that might not be happening here.

-- 
Eric Evans
eevans@rackspace.com


Re: writes to Cassandra failing occasionally

Posted by Mike Gallamore <mi...@googlemail.com>.
On 04/08/2010 05:53 AM, Ted Zlatanov wrote:
> On Thu, 08 Apr 2010 12:53:48 +0100 Philip Jackson<ph...@shellarchive.co.uk>  wrote:
>
> PJ>  At Wed, 07 Apr 2010 13:19:26 -0700,
> PJ>  Mike Gallamore wrote:
>    
>>> I have writes to cassandra that are failing, or at least a read shortly
>>> after a write is still getting an old value. I realize Cassandra is
>>> "eventually consistent" but this system is a single CPU single node with
>>> consistency level set to 1, so this seems odd to me.
>>>        
> PJ>  I'm having this problem too (see my post the other day). I use N::C
> PJ>  but generate timestamps in the same way as N::C::E, I've tested that
> PJ>  each is smaller than the next so I'm wondering if I'm barking up the
> PJ>  wrong tree.
>
> PJ>  If you figure out what's going on please do post back here, I'll do
> PJ>  the same.
>
> Please put together a test that runs against the default keyspace
> (Keyspace1) or give me your configuration plus a test.  At the very
> least, set $Net::Cassandra::Easy::DEBUG to 1 and look at the timestamps
> it's generating in the Thrift requests.
>
> By default N::C::Easy uses Moose to provide timestamps through a
> double-wrapped (so it's called every time) sub:
>
> has timestamp =>  ( is =>  'ro', isa =>  'CodeRef', default =>  sub { sub { join('', gettimeofday()) } } );
>
> This has worked for me but it certainly could be the problem.
>
> Ted
>
>    
Hello Ted. Thanks for your help. I already looked at that line and 
modified it slightly so that it would output the timestamp  so I could 
inspect it. I didn't notice it, the timestamp seemed to be always 
increasing as the most significant digits were increasing. One of my 
colleagues when looking over my shoulder noticed that the number of 
digits in the timestamp when it failed is different. I think this is a 
bug. A straight join won't work because the microseconds part of the 
Time::HiRes return is of varying length, ie it isn't zero padded so 23s 
25 microseconds would show up as 2325 with the join not 23000025 which 
would mean that it would look older than 22.9999993 for example.

I've written a patch that zero pads the numbers. I've attached it to 
this post but encase attachments don't come through on this mailinglist 
here is the body:

diff -ru Net-Cassandra-Easy-0.09/lib/Net/Cassandra/Easy.pm 
Net-Cassandra-Easy-0.09-patched/lib/Net/Cassandra/Easy.pm
--- Net-Cassandra-Easy-0.09/lib/Net/Cassandra/Easy.pm    2010-04-06 
13:03:59.000000000 -0700
+++ Net-Cassandra-Easy-0.09-patched/lib/Net/Cassandra/Easy.pm    
2010-04-08 11:29:39.000000000 -0700
@@ -40,7 +40,11 @@
  has recv_buffer  => ( is => 'ro', isa => 'Int',     default => 1024 );
  has send_buffer  => ( is => 'ro', isa => 'Int',     default => 1024 );
  has max_results  => ( is => 'ro', isa => 'Int',     default => 100 );
-has timestamp    => ( is => 'ro', isa => 'CodeRef', default => sub { 
sub { join('', gettimeofday()) } } );
+has timestamp    => ( is => 'ro', isa => 'CodeRef', default => sub {
+        sub {
+            my ($secs, $usecs) = gettimeofday();
+            return $secs . sprintf "%0.6d", $usecs;
+        } } );

  # read and write consistency can be changed on the fly
  has read_consistency  => ( is => 'rw', isa => 'Int', default => 
Net::GenCassandra::ConsistencyLevel::ONE );

I'm not sure if it is a setting I need to change or what but I don't see 
my own posts to the user group so if you could reply to this to let me 
know that you saw it it would be great. Otherwise I'll never know if you 
got the patch.

Re: writes to Cassandra failing occasionally

Posted by Ted Zlatanov <tz...@lifelogs.com>.
On Thu, 08 Apr 2010 12:53:48 +0100 Philip Jackson <ph...@shellarchive.co.uk> wrote: 

PJ> At Wed, 07 Apr 2010 13:19:26 -0700,
PJ> Mike Gallamore wrote:
>> 
>> I have writes to cassandra that are failing, or at least a read shortly 
>> after a write is still getting an old value. I realize Cassandra is 
>> "eventually consistent" but this system is a single CPU single node with 
>> consistency level set to 1, so this seems odd to me.

PJ> I'm having this problem too (see my post the other day). I use N::C
PJ> but generate timestamps in the same way as N::C::E, I've tested that
PJ> each is smaller than the next so I'm wondering if I'm barking up the
PJ> wrong tree.

PJ> If you figure out what's going on please do post back here, I'll do
PJ> the same.

Please put together a test that runs against the default keyspace
(Keyspace1) or give me your configuration plus a test.  At the very
least, set $Net::Cassandra::Easy::DEBUG to 1 and look at the timestamps
it's generating in the Thrift requests.

By default N::C::Easy uses Moose to provide timestamps through a
double-wrapped (so it's called every time) sub:

has timestamp => ( is => 'ro', isa => 'CodeRef', default => sub { sub { join('', gettimeofday()) } } );

This has worked for me but it certainly could be the problem.

Ted


Re: writes to Cassandra failing occasionally

Posted by Ted Zlatanov <tz...@lifelogs.com>.
On Thu, 8 Apr 2010 10:56:55 -0500 Jonathan Ellis <jb...@gmail.com> wrote: 

JE> is N:C:E possibly ignoring thrift exceptions?

I always pass them down to the user.  The user is responsible for
wrapping with eval().

Ted


Re: writes to Cassandra failing occasionally

Posted by Jonathan Ellis <jb...@gmail.com>.
is N:C:E possibly ignoring thrift exceptions?

On Thu, Apr 8, 2010 at 10:45 AM, Mike Gallamore
<mi...@googlemail.com> wrote:
> On 04/08/2010 04:53 AM, Philip Jackson wrote:
>>
>> At Wed, 07 Apr 2010 13:19:26 -0700,
>> Mike Gallamore wrote:
>>
>>>
>>> I have writes to cassandra that are failing, or at least a read shortly
>>> after a write is still getting an old value. I realize Cassandra is
>>> "eventually consistent" but this system is a single CPU single node with
>>> consistency level set to 1, so this seems odd to me.
>>>
>>
>> I'm having this problem too (see my post the other day). I use N::C
>> but generate timestamps in the same way as N::C::E, I've tested that
>> each is smaller than the next so I'm wondering if I'm barking up the
>> wrong tree.
>>
>> If you figure out what's going on please do post back here, I'll do
>> the same.
>>
>> Cheers,
>> Phil
>>
>
> I modified N::C::E to output the timestamp each time the timestamp
> subroutine is called. I've confirmed that the timestamp is increasing (never
> getting two updates with the same timestamp) but I'm still loosing data.
>
> This is painful because part of the data structure I'm working with is a
> revolving one week bucket of daily scores. When the system forgets that the
> day has changed it thinks it is still yesterday and "nukes" the data in that
> bucket (which it just set) thinking that it is week old data. The desired
> behavior of course is that it realize that it is a new day (this is stored
> in the data structure itself and isn't modifible as it depends on when the
> data was first created for a particular row which "bucket" the system will
> be working with any given day) and nuke the new day's old data before
> updating it with a new score.
>

Re: writes to Cassandra failing occasionally

Posted by Mike Gallamore <mi...@googlemail.com>.
On 04/08/2010 04:53 AM, Philip Jackson wrote:
> At Wed, 07 Apr 2010 13:19:26 -0700,
> Mike Gallamore wrote:
>    
>> I have writes to cassandra that are failing, or at least a read shortly
>> after a write is still getting an old value. I realize Cassandra is
>> "eventually consistent" but this system is a single CPU single node with
>> consistency level set to 1, so this seems odd to me.
>>      
> I'm having this problem too (see my post the other day). I use N::C
> but generate timestamps in the same way as N::C::E, I've tested that
> each is smaller than the next so I'm wondering if I'm barking up the
> wrong tree.
>
> If you figure out what's going on please do post back here, I'll do
> the same.
>
> Cheers,
> Phil
>    
I modified N::C::E to output the timestamp each time the timestamp 
subroutine is called. I've confirmed that the timestamp is increasing 
(never getting two updates with the same timestamp) but I'm still 
loosing data.

This is painful because part of the data structure I'm working with is a 
revolving one week bucket of daily scores. When the system forgets that 
the day has changed it thinks it is still yesterday and "nukes" the data 
in that bucket (which it just set) thinking that it is week old data. 
The desired behavior of course is that it realize that it is a new day 
(this is stored in the data structure itself and isn't modifible as it 
depends on when the data was first created for a particular row which 
"bucket" the system will be working with any given day) and nuke the new 
day's old data before updating it with a new score.

Re: Some insight into the slow read speed. Where to go from here? RC1 MESSAGE-DESERIALIZER-POOL

Posted by Jonathan Ellis <jb...@gmail.com>.
The sawtooth ram graph is typical of normal GC activity, btw -- the GC
doesn't bother with a major collection until it reaches some percent
of the total available.

RE: Some insight into the slow read speed. Where to go from here? RC1 MESSAGE-DESERIALIZER-POOL

Posted by Mark Jones <MJ...@imagehawk.com>.
I restarted Cassandra on that node to clear out that queue, reduced the available memory to java to 4GB, and now I'm able to read with 8 concurrent threads, about 110/second

Running iostat -x I see a large amount of time in await, and a small amount of time in svctm indicating the device is responding quickly but the queue length is excessive.

All the data filesystems are xfs on 64 bit debian, Cassandra is the only thing reading/writing them, the CPUs are napping

Linux 2.6.31-14-generic (record)        04/08/2010      _x86_64_        (2 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          17.77    0.06    5.62   14.32    0.00   62.24

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda              14.56    87.14    5.62    2.02   181.05   712.94   116.98     0.14   18.66   4.64   3.55
sdb               0.03     0.00    0.02    0.00     0.34     0.01    14.22     0.00   10.00   9.99   0.02
sdc               0.18    83.79   91.08    2.13  4159.40  2024.88    66.34     5.35   57.38   3.43  31.94

$ iostat -x
Linux 2.6.31-14-generic (record)        04/08/2010      _x86_64_        (2 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          17.77    0.06    5.62   14.32    0.00   62.24

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sda              14.56    87.13    5.62    2.02   181.04   712.92   116.98     0.14   18.66   4.64   3.55
sdb               0.03     0.00    0.02    0.00     0.34     0.01    14.22     0.00   10.00   9.99   0.02
sdc               0.18    83.78   91.09    2.13  4159.65  2024.82    66.34     5.35   57.38   3.43  31.95

-----Original Message-----
From: Jonathan Ellis [mailto:jbellis@gmail.com]
Sent: Thursday, April 08, 2010 10:12 AM
To: user@cassandra.apache.org
Subject: Re: Some insight into the slow read speed. Where to go from here? RC1 MESSAGE-DESERIALIZER-POOL

Have you checked iostat -x ?

On Thu, Apr 8, 2010 at 9:45 AM, Mark Jones <MJ...@imagehawk.com> wrote:
> I don't see any way to increase the # of active Deserializers in storage-conf.xml
>
> Tpstats more than 8 hours after insert/read stop
>
> Pool Name                    Active   Pending      Completed
> FILEUTILS-DELETE-POOL             0         0            227
> STREAM-STAGE                      0         0              1
> RESPONSE-STAGE                    0         0       76724280
> ROW-READ-STAGE                    8      4091        1138277
> LB-OPERATIONS                     0         0              0
> MESSAGE-DESERIALIZER-POOL         1   1849826       78135012
> GMFD                              0         0         136886
> LB-TARGET                         0         0              0
> CONSISTENCY-MANAGER               0         0           1803
> ROW-MUTATION-STAGE                0         0       68669717
> MESSAGE-STREAMING-POOL            0         0              0
> LOAD-BALANCER-STAGE               0         0              0
> FLUSH-SORTER-POOL                 0         0              0
> MEMTABLE-POST-FLUSHER             0         0            438
> FLUSH-WRITER-POOL                 0         0            438
> AE-SERVICE-STAGE                  0         0              3
> HINTED-HANDOFF-POOL               0         0              3
>
> More than 30 minutes later (with no reads or writes to the cluster)
>
> Pool Name                    Active   Pending      Completed
> FILEUTILS-DELETE-POOL             0         0            227
> STREAM-STAGE                      0         0              1
> RESPONSE-STAGE                    0         0       76724280
> ROW-READ-STAGE                    8      4098        1314304
> LB-OPERATIONS                     0         0              0
> MESSAGE-DESERIALIZER-POOL         1   1663578       78336771
> GMFD                              0         0         142651
> LB-TARGET                         0         0              0
> CONSISTENCY-MANAGER               0         0           1803
> ROW-MUTATION-STAGE                0         0       68669717
> MESSAGE-STREAMING-POOL            0         0              0
> LOAD-BALANCER-STAGE               0         0              0
> FLUSH-SORTER-POOL                 0         0              0
> MEMTABLE-POST-FLUSHER             0         0            438
> FLUSH-WRITER-POOL                 0         0            438
> AE-SERVICE-STAGE                  0         0              3
> HINTED-HANDOFF-POOL               0         0              3
>
> The other 2 nodes in the cluster have Pending Counts of 0, but this node seems hung
> indefinitely processing requests that should have long ago timed out for the client.
>
> TOP is showing a huge amount of I/O Wait, but I'm not sure how to track where the wait is happening below here.  I now have jconsole up and running on this machine, and the memory usage appears to be a saw tooth wave, going from 1GB up to 4GB over 3 hours, then plunging back to 1GB and resuming its climb.
>
> top - 08:33:40 up 1 day, 19:25,  4 users,  load average: 7.75, 7.96, 8.16
> Tasks: 177 total,   2 running, 175 sleeping,   0 stopped,   0 zombie
> Cpu(s): 16.6%us,  7.2%sy,  0.0%ni, 34.5%id, 41.1%wa,  0.0%hi,  0.6%si,  0.0%st
> Mem:   8123068k total,  8062240k used,    60828k free,     2624k buffers
> Swap: 12699340k total,  1951504k used, 10747836k free,  3757300k cached
>

Re: Some insight into the slow read speed. Where to go from here? RC1 MESSAGE-DESERIALIZER-POOL

Posted by Jonathan Ellis <jb...@gmail.com>.
Have you checked iostat -x ?

On Thu, Apr 8, 2010 at 9:45 AM, Mark Jones <MJ...@imagehawk.com> wrote:
> I don't see any way to increase the # of active Deserializers in storage-conf.xml
>
> Tpstats more than 8 hours after insert/read stop
>
> Pool Name                    Active   Pending      Completed
> FILEUTILS-DELETE-POOL             0         0            227
> STREAM-STAGE                      0         0              1
> RESPONSE-STAGE                    0         0       76724280
> ROW-READ-STAGE                    8      4091        1138277
> LB-OPERATIONS                     0         0              0
> MESSAGE-DESERIALIZER-POOL         1   1849826       78135012
> GMFD                              0         0         136886
> LB-TARGET                         0         0              0
> CONSISTENCY-MANAGER               0         0           1803
> ROW-MUTATION-STAGE                0         0       68669717
> MESSAGE-STREAMING-POOL            0         0              0
> LOAD-BALANCER-STAGE               0         0              0
> FLUSH-SORTER-POOL                 0         0              0
> MEMTABLE-POST-FLUSHER             0         0            438
> FLUSH-WRITER-POOL                 0         0            438
> AE-SERVICE-STAGE                  0         0              3
> HINTED-HANDOFF-POOL               0         0              3
>
> More than 30 minutes later (with no reads or writes to the cluster)
>
> Pool Name                    Active   Pending      Completed
> FILEUTILS-DELETE-POOL             0         0            227
> STREAM-STAGE                      0         0              1
> RESPONSE-STAGE                    0         0       76724280
> ROW-READ-STAGE                    8      4098        1314304
> LB-OPERATIONS                     0         0              0
> MESSAGE-DESERIALIZER-POOL         1   1663578       78336771
> GMFD                              0         0         142651
> LB-TARGET                         0         0              0
> CONSISTENCY-MANAGER               0         0           1803
> ROW-MUTATION-STAGE                0         0       68669717
> MESSAGE-STREAMING-POOL            0         0              0
> LOAD-BALANCER-STAGE               0         0              0
> FLUSH-SORTER-POOL                 0         0              0
> MEMTABLE-POST-FLUSHER             0         0            438
> FLUSH-WRITER-POOL                 0         0            438
> AE-SERVICE-STAGE                  0         0              3
> HINTED-HANDOFF-POOL               0         0              3
>
> The other 2 nodes in the cluster have Pending Counts of 0, but this node seems hung
> indefinitely processing requests that should have long ago timed out for the client.
>
> TOP is showing a huge amount of I/O Wait, but I'm not sure how to track where the wait is happening below here.  I now have jconsole up and running on this machine, and the memory usage appears to be a saw tooth wave, going from 1GB up to 4GB over 3 hours, then plunging back to 1GB and resuming its climb.
>
> top - 08:33:40 up 1 day, 19:25,  4 users,  load average: 7.75, 7.96, 8.16
> Tasks: 177 total,   2 running, 175 sleeping,   0 stopped,   0 zombie
> Cpu(s): 16.6%us,  7.2%sy,  0.0%ni, 34.5%id, 41.1%wa,  0.0%hi,  0.6%si,  0.0%st
> Mem:   8123068k total,  8062240k used,    60828k free,     2624k buffers
> Swap: 12699340k total,  1951504k used, 10747836k free,  3757300k cached
>

Re: SAR results don't seem overwhelming

Posted by Jonathan Ellis <jb...@gmail.com>.
Can you keep this to one thread please?  It is hard to follow when the
subject keeps changing.

SAR results don't seem overwhelming

Posted by Mark Jones <MJ...@imagehawk.com>.
I stopped writing to the cluster more than 8 hours ago, at worst case, I could only be getting a periodic memtable dump (I think)

Running 16 QUORUM read threads getting 600 records/second

Sar for all 3 nodes (collected almost simultaneously:
Average:        CPU     %user     %nice   %system   %iowait    %steal     %idle
Average:        all     10.86      0.00      2.61     44.47      0.00     42.06

Average:          tps      rtps      wtps   bread/s   bwrtn/s
Average:       284.76    283.96      0.80  14541.83      7.17
----------------
Average:        CPU     %user     %nice   %system   %iowait    %steal     %idle
Average:        all     14.33      0.00      2.99     31.45      0.00     51.23

Average:          tps      rtps      wtps   bread/s   bwrtn/s
Average:       219.26    217.96      1.30   4320.16     90.22
----------------
Average:        CPU     %user     %nice   %system   %iowait    %steal     %idle
Average:        all     51.76      0.00      7.50     28.38      0.00     12.35

Average:          tps      rtps      wtps   bread/s   bwrtn/s
Average:       164.72    163.73      0.99  15892.17      8.72

And the client:------------------------------
Average:        CPU     %user     %nice   %system   %iowait    %steal     %idle
Average:        all      0.35      0.00      0.89      0.00      0.00     98.77

Average:          tps      rtps      wtps   bread/s   bwrtn/s
Average:         0.90      0.10      0.80     25.60     27.20


From: Avinash Lakshman [mailto:avinash.lakshman@gmail.com]
Sent: Thursday, April 08, 2010 10:15 AM
To: user@cassandra.apache.org
Subject: Re: Some insight into the slow read speed. Where to go from here? RC1 MESSAGE-DESERIALIZER-POOL

The tooth wave in memory utilization could be memtable dumps. I/O wait in TCP happens when you are overwhelming the server with requests. Could you run sar and find out how many bytes/sec you are receiving/transmitting?

Cheers
Avinash
On Thu, Apr 8, 2010 at 7:45 AM, Mark Jones <MJ...@imagehawk.com>> wrote:
I don't see any way to increase the # of active Deserializers in storage-conf.xml

Tpstats more than 8 hours after insert/read stop

Pool Name                    Active   Pending      Completed
FILEUTILS-DELETE-POOL             0         0            227
STREAM-STAGE                      0         0              1
RESPONSE-STAGE                    0         0       76724280
ROW-READ-STAGE                    8      4091        1138277
LB-OPERATIONS                     0         0              0
MESSAGE-DESERIALIZER-POOL         1   1849826       78135012
GMFD                              0         0         136886
LB-TARGET                         0         0              0
CONSISTENCY-MANAGER               0         0           1803
ROW-MUTATION-STAGE                0         0       68669717
MESSAGE-STREAMING-POOL            0         0              0
LOAD-BALANCER-STAGE               0         0              0
FLUSH-SORTER-POOL                 0         0              0
MEMTABLE-POST-FLUSHER             0         0            438
FLUSH-WRITER-POOL                 0         0            438
AE-SERVICE-STAGE                  0         0              3
HINTED-HANDOFF-POOL               0         0              3

More than 30 minutes later (with no reads or writes to the cluster)

Pool Name                    Active   Pending      Completed
FILEUTILS-DELETE-POOL             0         0            227
STREAM-STAGE                      0         0              1
RESPONSE-STAGE                    0         0       76724280
ROW-READ-STAGE                    8      4098        1314304
LB-OPERATIONS                     0         0              0
MESSAGE-DESERIALIZER-POOL         1   1663578       78336771
GMFD                              0         0         142651
LB-TARGET                         0         0              0
CONSISTENCY-MANAGER               0         0           1803
ROW-MUTATION-STAGE                0         0       68669717
MESSAGE-STREAMING-POOL            0         0              0
LOAD-BALANCER-STAGE               0         0              0
FLUSH-SORTER-POOL                 0         0              0
MEMTABLE-POST-FLUSHER             0         0            438
FLUSH-WRITER-POOL                 0         0            438
AE-SERVICE-STAGE                  0         0              3
HINTED-HANDOFF-POOL               0         0              3

The other 2 nodes in the cluster have Pending Counts of 0, but this node seems hung
indefinitely processing requests that should have long ago timed out for the client.

TOP is showing a huge amount of I/O Wait, but I'm not sure how to track where the wait is happening below here.  I now have jconsole up and running on this machine, and the memory usage appears to be a saw tooth wave, going from 1GB up to 4GB over 3 hours, then plunging back to 1GB and resuming its climb.

top - 08:33:40 up 1 day, 19:25,  4 users,  load average: 7.75, 7.96, 8.16
Tasks: 177 total,   2 running, 175 sleeping,   0 stopped,   0 zombie
Cpu(s): 16.6%us,  7.2%sy,  0.0%ni, 34.5%id, 41.1%wa,  0.0%hi,  0.6%si,  0.0%st
Mem:   8123068k total,  8062240k used,    60828k free,     2624k buffers
Swap: 12699340k total,  1951504k used, 10747836k free,  3757300k cached


Re: Some insight into the slow read speed. Where to go from here? RC1 MESSAGE-DESERIALIZER-POOL

Posted by Avinash Lakshman <av...@gmail.com>.
The tooth wave in memory utilization could be memtable dumps. I/O wait in
TCP happens when you are overwhelming the server with requests. Could you
run sar and find out how many bytes/sec you are receiving/transmitting?

Cheers
Avinash

On Thu, Apr 8, 2010 at 7:45 AM, Mark Jones <MJ...@imagehawk.com> wrote:

> I don't see any way to increase the # of active Deserializers in
> storage-conf.xml
>
> Tpstats more than 8 hours after insert/read stop
>
> Pool Name                    Active   Pending      Completed
> FILEUTILS-DELETE-POOL             0         0            227
> STREAM-STAGE                      0         0              1
> RESPONSE-STAGE                    0         0       76724280
> ROW-READ-STAGE                    8      4091        1138277
> LB-OPERATIONS                     0         0              0
> MESSAGE-DESERIALIZER-POOL         1   1849826       78135012
> GMFD                              0         0         136886
> LB-TARGET                         0         0              0
> CONSISTENCY-MANAGER               0         0           1803
> ROW-MUTATION-STAGE                0         0       68669717
> MESSAGE-STREAMING-POOL            0         0              0
> LOAD-BALANCER-STAGE               0         0              0
> FLUSH-SORTER-POOL                 0         0              0
> MEMTABLE-POST-FLUSHER             0         0            438
> FLUSH-WRITER-POOL                 0         0            438
> AE-SERVICE-STAGE                  0         0              3
> HINTED-HANDOFF-POOL               0         0              3
>
> More than 30 minutes later (with no reads or writes to the cluster)
>
> Pool Name                    Active   Pending      Completed
> FILEUTILS-DELETE-POOL             0         0            227
> STREAM-STAGE                      0         0              1
> RESPONSE-STAGE                    0         0       76724280
> ROW-READ-STAGE                    8      4098        1314304
> LB-OPERATIONS                     0         0              0
> MESSAGE-DESERIALIZER-POOL         1   1663578       78336771
> GMFD                              0         0         142651
> LB-TARGET                         0         0              0
> CONSISTENCY-MANAGER               0         0           1803
> ROW-MUTATION-STAGE                0         0       68669717
> MESSAGE-STREAMING-POOL            0         0              0
> LOAD-BALANCER-STAGE               0         0              0
> FLUSH-SORTER-POOL                 0         0              0
> MEMTABLE-POST-FLUSHER             0         0            438
> FLUSH-WRITER-POOL                 0         0            438
> AE-SERVICE-STAGE                  0         0              3
> HINTED-HANDOFF-POOL               0         0              3
>
> The other 2 nodes in the cluster have Pending Counts of 0, but this node
> seems hung
> indefinitely processing requests that should have long ago timed out for
> the client.
>
> TOP is showing a huge amount of I/O Wait, but I'm not sure how to track
> where the wait is happening below here.  I now have jconsole up and running
> on this machine, and the memory usage appears to be a saw tooth wave, going
> from 1GB up to 4GB over 3 hours, then plunging back to 1GB and resuming its
> climb.
>
> top - 08:33:40 up 1 day, 19:25,  4 users,  load average: 7.75, 7.96, 8.16
> Tasks: 177 total,   2 running, 175 sleeping,   0 stopped,   0 zombie
> Cpu(s): 16.6%us,  7.2%sy,  0.0%ni, 34.5%id, 41.1%wa,  0.0%hi,  0.6%si,
>  0.0%st
> Mem:   8123068k total,  8062240k used,    60828k free,     2624k buffers
> Swap: 12699340k total,  1951504k used, 10747836k free,  3757300k cached
>

Some insight into the slow read speed. Where to go from here? RC1 MESSAGE-DESERIALIZER-POOL

Posted by Mark Jones <MJ...@imagehawk.com>.
I don't see any way to increase the # of active Deserializers in storage-conf.xml

Tpstats more than 8 hours after insert/read stop

Pool Name                    Active   Pending      Completed
FILEUTILS-DELETE-POOL             0         0            227
STREAM-STAGE                      0         0              1
RESPONSE-STAGE                    0         0       76724280
ROW-READ-STAGE                    8      4091        1138277
LB-OPERATIONS                     0         0              0
MESSAGE-DESERIALIZER-POOL         1   1849826       78135012
GMFD                              0         0         136886
LB-TARGET                         0         0              0
CONSISTENCY-MANAGER               0         0           1803
ROW-MUTATION-STAGE                0         0       68669717
MESSAGE-STREAMING-POOL            0         0              0
LOAD-BALANCER-STAGE               0         0              0
FLUSH-SORTER-POOL                 0         0              0
MEMTABLE-POST-FLUSHER             0         0            438
FLUSH-WRITER-POOL                 0         0            438
AE-SERVICE-STAGE                  0         0              3
HINTED-HANDOFF-POOL               0         0              3

More than 30 minutes later (with no reads or writes to the cluster)

Pool Name                    Active   Pending      Completed
FILEUTILS-DELETE-POOL             0         0            227
STREAM-STAGE                      0         0              1
RESPONSE-STAGE                    0         0       76724280
ROW-READ-STAGE                    8      4098        1314304
LB-OPERATIONS                     0         0              0
MESSAGE-DESERIALIZER-POOL         1   1663578       78336771
GMFD                              0         0         142651
LB-TARGET                         0         0              0
CONSISTENCY-MANAGER               0         0           1803
ROW-MUTATION-STAGE                0         0       68669717
MESSAGE-STREAMING-POOL            0         0              0
LOAD-BALANCER-STAGE               0         0              0
FLUSH-SORTER-POOL                 0         0              0
MEMTABLE-POST-FLUSHER             0         0            438
FLUSH-WRITER-POOL                 0         0            438
AE-SERVICE-STAGE                  0         0              3
HINTED-HANDOFF-POOL               0         0              3

The other 2 nodes in the cluster have Pending Counts of 0, but this node seems hung
indefinitely processing requests that should have long ago timed out for the client.

TOP is showing a huge amount of I/O Wait, but I'm not sure how to track where the wait is happening below here.  I now have jconsole up and running on this machine, and the memory usage appears to be a saw tooth wave, going from 1GB up to 4GB over 3 hours, then plunging back to 1GB and resuming its climb.

top - 08:33:40 up 1 day, 19:25,  4 users,  load average: 7.75, 7.96, 8.16
Tasks: 177 total,   2 running, 175 sleeping,   0 stopped,   0 zombie
Cpu(s): 16.6%us,  7.2%sy,  0.0%ni, 34.5%id, 41.1%wa,  0.0%hi,  0.6%si,  0.0%st
Mem:   8123068k total,  8062240k used,    60828k free,     2624k buffers
Swap: 12699340k total,  1951504k used, 10747836k free,  3757300k cached

Re: writes to Cassandra failing occasionally

Posted by Ted Zlatanov <tz...@lifelogs.com>.
On Thu, 08 Apr 2010 12:16:34 -0700 Mike Gallamore <mi...@googlemail.com> wrote: 

MG> Hopefully my fix helps others. I imagine it is something you'll run
MG> into regardless of the language/interface you use, for example I'm
MG> pretty sure that the C/C++ time function truncates values too. I'd
MG> recommend anyone using time to generate your timestamp: be careful
MG> that your timestamp is always the same length (or at least that the
MG> sub components that you are concatenating are the length you expect
MG> them to be).

This was a Perl-related bug so I doubt other will see it.  It's really
caused by the fact that 32-bit Perl doesn't have a native 64-bit
pack/unpack function so I'm using the Bit::Vector wrappers and
consequently passing around Longs as strings.

MG> I've written a patch that zero pads the numbers. I've attached it to
MG> this post but encase attachments don't come through on this
MG> mailinglist here is the body:

Thanks so much for catching this.  I didn't notice it at all (it works
90% of the time!).  I uploaded N::C::Easy 0.10 to CPAN with the fix you
proposed, so now timestamps are produced correctly.

Ted


Re: writes to Cassandra failing occasionally

Posted by Mike Gallamore <mi...@googlemail.com>.
On 04/08/2010 11:46 AM, Jeremy Dunck wrote:
> On Thu, Apr 8, 2010 at 12:41 PM, Mike Gallamore
> <mi...@googlemail.com>  wrote:
>    
>> Hello. If you are doing exactly the same thing as N::C::Easy (ie a join on
>> the gettimeofday). Then you should have the same problem I found a fix for.
>> The problem is that the microseconds value isn't zero padded. So if you are
>> at say 23s 25 micro seconds you get a date: 2325. not 23000025 which of
>> course means that sometimes your next timestamp will be "older" than your
>> previous one. Here is a patch to N::C::Easy that fixes this and hopefully a
>> similar thing will fix your code.
>>      
> All maintainers of competing libraries should be so helpful.  :-)
>    
Thanks for the compliment but I'm not a maintainer of N::C::Easy just a 
user. Ted Zlatanov is the maintainer of the package and is quite active 
on the forums. I'd already started looking at the area of the code 
before he mentioned it, but if I hadn't thought of it his suggestions 
definitely would have pointed me in the right direction. Also, I've 
literally seen updates to the package the same day that a new version of 
Cassandra comes out. Good job Ted.

Hopefully my fix helps others. I imagine it is something you'll run into 
regardless of the language/interface you use, for example I'm pretty 
sure that the C/C++ time function truncates values too. I'd recommend 
anyone using time to generate your timestamp: be careful that your 
timestamp is always the same length (or at least that the sub components 
that you are concatenating are the length you expect them to be).

Cheers

Re: writes to Cassandra failing occasionally

Posted by Jeremy Dunck <jd...@gmail.com>.
On Thu, Apr 8, 2010 at 12:41 PM, Mike Gallamore
<mi...@googlemail.com> wrote:
> Hello. If you are doing exactly the same thing as N::C::Easy (ie a join on
> the gettimeofday). Then you should have the same problem I found a fix for.
> The problem is that the microseconds value isn't zero padded. So if you are
> at say 23s 25 micro seconds you get a date: 2325. not 23000025 which of
> course means that sometimes your next timestamp will be "older" than your
> previous one. Here is a patch to N::C::Easy that fixes this and hopefully a
> similar thing will fix your code.

All maintainers of competing libraries should be so helpful.  :-)

Re: writes to Cassandra failing occasionally

Posted by Philip Jackson <ph...@shellarchive.co.uk>.
At Thu, 08 Apr 2010 11:41:30 -0700,
Mike Gallamore wrote:
> 
> [1  <text/plain; ISO-8859-1 (7bit)>]
> On 04/08/2010 04:53 AM, Philip Jackson wrote:
> > At Wed, 07 Apr 2010 13:19:26 -0700,
> > Mike Gallamore wrote:
> >    
> >> I have writes to cassandra that are failing, or at least a read shortly
> >> after a write is still getting an old value. I realize Cassandra is
> >> "eventually consistent" but this system is a single CPU single node with
> >> consistency level set to 1, so this seems odd to me.
> >>      
> > I'm having this problem too (see my post the other day). I use N::C
> > but generate timestamps in the same way as N::C::E, I've tested that
> > each is smaller than the next so I'm wondering if I'm barking up the
> > wrong tree.
> >
> > If you figure out what's going on please do post back here, I'll do
> > the same.
> >
> > Cheers,
> > Phil
> >    
> Hello. If you are doing exactly the same thing as N::C::Easy (ie a join 
> on the gettimeofday). Then you should have the same problem I found a 
> fix for. The problem is that the microseconds value isn't zero padded. 

Fantastic, you've made my night. Thanks to you and Ted.

Cheers,
Phil

Re: writes to Cassandra failing occasionally

Posted by Mike Gallamore <mi...@googlemail.com>.
On 04/08/2010 04:53 AM, Philip Jackson wrote:
> At Wed, 07 Apr 2010 13:19:26 -0700,
> Mike Gallamore wrote:
>    
>> I have writes to cassandra that are failing, or at least a read shortly
>> after a write is still getting an old value. I realize Cassandra is
>> "eventually consistent" but this system is a single CPU single node with
>> consistency level set to 1, so this seems odd to me.
>>      
> I'm having this problem too (see my post the other day). I use N::C
> but generate timestamps in the same way as N::C::E, I've tested that
> each is smaller than the next so I'm wondering if I'm barking up the
> wrong tree.
>
> If you figure out what's going on please do post back here, I'll do
> the same.
>
> Cheers,
> Phil
>    
Hello. If you are doing exactly the same thing as N::C::Easy (ie a join 
on the gettimeofday). Then you should have the same problem I found a 
fix for. The problem is that the microseconds value isn't zero padded. 
So if you are at say 23s 25 micro seconds you get a date: 2325. not 
23000025 which of course means that sometimes your next timestamp will 
be "older" than your previous one. Here is a patch to N::C::Easy that 
fixes this and hopefully a similar thing will fix your code.



Re: writes to Cassandra failing occasionally

Posted by Philip Jackson <ph...@shellarchive.co.uk>.
At Wed, 07 Apr 2010 13:19:26 -0700,
Mike Gallamore wrote:
> 
> I have writes to cassandra that are failing, or at least a read shortly 
> after a write is still getting an old value. I realize Cassandra is 
> "eventually consistent" but this system is a single CPU single node with 
> consistency level set to 1, so this seems odd to me.

I'm having this problem too (see my post the other day). I use N::C
but generate timestamps in the same way as N::C::E, I've tested that
each is smaller than the next so I'm wondering if I'm barking up the
wrong tree.

If you figure out what's going on please do post back here, I'll do
the same.

Cheers,
Phil