You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Michael Theroux <mt...@yahoo.com> on 2013/04/24 14:03:17 UTC

Really odd issue (AWS related?)

Hello,

Since Sunday, we've been experiencing a really odd issue in our Cassandra cluster.  We recently started receiving errors that messages are being dropped.  But here is the odd part...

When looking in the AWS console, instead of seeing statistics being elevated during this time, we actually see all statistics suddenly drop right before these messages appear.  CPU, I/O, and network go way down.  In fact, in one case, they went to 0 for about 5 minutes to the point that other cassandra nodes saw this specific node in question as being down.  The messages appear right after the node "wakes up".

We've had this happen on 3 different nodes on three different days since Sunday.

Other facts:

- We recently upgraded from m1.large to m1.xlarge instances about two weeks ago.
- We are running Cassandra 1.1.9
- We've been doing some memory tuning, although I have seen this happen on untuned nodes.

Has anyone seen anything like this before?

Another related question.  Once we see messages being dropped on one node, our cassandra client appears to see this, reporting errors.  We use LOCAL_QUORUM with a RF of 3 on all queries.  Any idea why clients would see an error?  If only one node reports an error, shouldn't the consistency level prevent the client from seeing an issue?

Thanks for your help,
-Mike

Re: Really odd issue (AWS related?)

Posted by Ben Chobot <be...@instructure.com>.

We've also had issues with ephemeral drives in a single AZ in us-east-1, so much so that we no longer use that AZ. Though our issues tended to be obvious from instance boot - they wouldn't suddenly degrade.

On Apr 28, 2013, at 2:27 PM, Alex Major wrote:

> Hi Mike,
> 
> We had issues with the ephemeral drives when we first got started, although we never got to the bottom of it so I can't help much with troubleshooting unfortunately. Contrary to a lot of the comments on the mailing list we've actually had a lot more success with EBS drives (PIOPs!). I'd definitely suggest try striping 4 EBS drives (Raid 0) and using PIOPs.
> 
> You could be having a noisy neighbour problem, I don't believe that m1.large or m1.xlarge instances get all of the actual hardware, virtualisation on EC2 still sucks in isolating resources.
> 
> We've also had more success with Ubuntu on EC2, not so much with our Cassandra nodes but some of our other services didn't run as well on Amazon Linux AMIs.
> 
> Alex
> 
> 
> 
> On Sun, Apr 28, 2013 at 7:12 PM, Michael Theroux <mt...@yahoo.com> wrote:
> I forgot to mention,
> 
> When things go really bad, I'm seeing I/O waits in the 80->95% range.  I restarted cassandra once when a node is in this situation, and it took 45 minutes to start (primarily reading SSTables).  Typically, a node would start in about 5 minutes.
> 
> Thanks,
> -Mike
>  
> On Apr 28, 2013, at 12:37 PM, Michael Theroux wrote:
> 
>> Hello,
>> 
>> We've done some additional monitoring, and I think we have more information.  We've been collecting vmstat information every minute, attempting to catch  a node with issues,.
>> 
>> So, it appears, that the cassandra node runs fine.  Then suddenly, without any correlation to any event that I can identify, the I/O wait time goes way up, and stays up indefinitely.  Even non-cassandra  I/O activities (such as snapshots and backups) start causing large I/O Wait times when they typically would not.  Previous to an issue, we would typically see I/O wait times 3-4% with very few blocked processes on I/O.  Once this issue manifests itself, i/O wait times for the same activities jump to 30-40% with many blocked processes.  The I/O wait times do go back down when there is literally no activity.   
>> 
>> -  Updating the node to the latest Amazon Linux patches and rebooting the instance doesn't correct the issue.
>> -  Backing up the node, and replacing the instance does correct the issue.  I/O wait times return to normal.
>> 
>> One relatively recent change we've made is we upgraded to m1.xlarge instances which has 4 ephemeral drives available.  We create a logical volume from the 4 drives with the idea that we should be able to get increased I/O throughput.  When we ran m1.large instances, we had the same setup, although it was only using 2 ephemeral drives.  We chose to use LVM, vs. madm because we were having issues having madm create the raid volume reliably on restart (and research showed that this was a common problem).  LVM just worked (and had worked for months before this upgrade)..
>> 
>> For reference, this is the script we used to create the logical volume:
>> 
>> vgcreate mnt_vg /dev/sdb /dev/sdc /dev/sdd /dev/sde
>> lvcreate -L 1600G -n mnt_lv -i 4 mnt_vg -I 256K
>> blockdev --setra 65536 /dev/mnt_vg/mnt_lv
>> sleep 2
>> mkfs.xfs /dev/mnt_vg/mnt_lv
>> sleep 3
>> mkdir -p /data && mount -t xfs -o noatime /dev/mnt_vg/mnt_lv /data
>> sleep 3
>> 
>> Another tidbit... thus far (and this maybe only a coincidence), we've only had to replace DB nodes within a single availability zone within us-east.  Other availability zones, in the same region, have yet to show an issue.
>> 
>> It looks like I'm going to need to replace a third DB node today.  Any advice would be appreciated.
>> 
>> Thanks,
>> -Mike
>> 
>> 
>> On Apr 26, 2013, at 10:14 AM, Michael Theroux wrote:
>> 
>>> Thanks.
>>> 
>>> We weren't monitoring this value when the issue occurred, and this particular issue has not appeared for a couple of days (knock on wood).  Will keep an eye out though,
>>> 
>>> -Mike
>>> 
>>> On Apr 26, 2013, at 5:32 AM, Jason Wee wrote:
>>> 
>>>> top command? st : time stolen from this vm by the hypervisor
>>>> 
>>>> jason
>>>> 
>>>> 
>>>> On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux <mt...@yahoo.com> wrote:
>>>> Sorry, Not sure what CPU steal is :)
>>>> 
>>>> I have AWS console with detailed monitoring enabled... things seem to track close to the minute, so I can see the CPU load go to 0... then jump at about the minute Cassandra reports the dropped messages,
>>>> 
>>>> -Mike
>>>> 
>>>> On Apr 25, 2013, at 9:50 PM, aaron morton wrote:
>>>> 
>>>>>> The messages appear right after the node "wakes up".
>>>>> Are you tracking CPU steal ? 
>>>>> 
>>>>> -----------------
>>>>> Aaron Morton
>>>>> Freelance Cassandra Consultant
>>>>> New Zealand
>>>>> 
>>>>> @aaronmorton
>>>>> http://www.thelastpickle.com
>>>>> 
>>>>> On 25/04/2013, at 4:15 AM, Robert Coli <rc...@eventbrite.com> wrote:
>>>>> 
>>>>>> On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux <mt...@yahoo.com> wrote:
>>>>>>> Another related question.  Once we see messages being dropped on one node, our cassandra client appears to see this, reporting errors.  We use LOCAL_QUORUM with a RF of 3 on all queries.  Any idea why clients would see an error?  If only one node reports an error, shouldn't the consistency level prevent the client from seeing an issue?
>>>>>> 
>>>>>> If the client is talking to a broken/degraded coordinator node, RF/CL
>>>>>> are unable to protect it from RPCTimeout. If it is unable to
>>>>>> coordinate the request in a timely fashion, your clients will get
>>>>>> errors.
>>>>>> 
>>>>>> =Rob
>>>>> 
>>>> 
>>>> 
>>> 
>> 
> 
>

Re: Really odd issue (AWS related?)

Posted by Alex Major <al...@gmail.com>.

Hi Mike,

We had issues with the ephemeral drives when we first got started, although
we never got to the bottom of it so I can't help much with troubleshooting
unfortunately. Contrary to a lot of the comments on the mailing list we've
actually had a lot more success with EBS drives (PIOPs!). I'd definitely
suggest try striping 4 EBS drives (Raid 0) and using PIOPs.

You could be having a noisy neighbour problem, I don't believe that
m1.large or m1.xlarge instances get all of the actual hardware,
virtualisation on EC2 still sucks in isolating resources.

We've also had more success with Ubuntu on EC2, not so much with our
Cassandra nodes but some of our other services didn't run as well on Amazon
Linux AMIs.

Alex



On Sun, Apr 28, 2013 at 7:12 PM, Michael Theroux <mt...@yahoo.com>wrote:

> I forgot to mention,
>
> When things go really bad, I'm seeing I/O waits in the 80->95% range.  I
> restarted cassandra once when a node is in this situation, and it took 45
> minutes to start (primarily reading SSTables).  Typically, a node would
> start in about 5 minutes.
>
> Thanks,
> -Mike
>
> On Apr 28, 2013, at 12:37 PM, Michael Theroux wrote:
>
> Hello,
>
> We've done some additional monitoring, and I think we have more
> information.  We've been collecting vmstat information every minute,
> attempting to catch  a node with issues,.
>
> So, it appears, that the cassandra node runs fine.  Then suddenly, without
> any correlation to any event that I can identify, the I/O wait time goes
> way up, and stays up indefinitely.  Even non-cassandra  I/O activities
> (such as snapshots and backups) start causing large I/O Wait times when
> they typically would not.  Previous to an issue, we would typically see I/O
> wait times 3-4% with very few blocked processes on I/O.  Once this issue
> manifests itself, i/O wait times for the same activities jump to 30-40%
> with many blocked processes.  The I/O wait times do go back down when there
> is literally no activity.
>
> -  Updating the node to the latest Amazon Linux patches and rebooting the
> instance doesn't correct the issue.
> -  Backing up the node, and replacing the instance does correct the issue.
>  I/O wait times return to normal.
>
> One relatively recent change we've made is we upgraded to m1.xlarge
> instances which has 4 ephemeral drives available.  We create a logical
> volume from the 4 drives with the idea that we should be able to get
> increased I/O throughput.  When we ran m1.large instances, we had the same
> setup, although it was only using 2 ephemeral drives.  We chose to use LVM,
> vs. madm because we were having issues having madm create the raid volume
> reliably on restart (and research showed that this was a common problem).
>  LVM just worked (and had worked for months before this upgrade)..
>
> For reference, this is the script we used to create the logical volume:
>
> vgcreate mnt_vg /dev/sdb /dev/sdc /dev/sdd /dev/sde
> lvcreate -L 1600G -n mnt_lv -i 4 mnt_vg -I 256K
> blockdev --setra 65536 /dev/mnt_vg/mnt_lv
> sleep 2
> mkfs.xfs /dev/mnt_vg/mnt_lv
> sleep 3
> mkdir -p /data && mount -t xfs -o noatime /dev/mnt_vg/mnt_lv /data
> sleep 3
>
> Another tidbit... thus far (and this maybe only a coincidence), we've only
> had to replace DB nodes within a single availability zone within us-east.
>  Other availability zones, in the same region, have yet to show an issue.
>
> It looks like I'm going to need to replace a third DB node today.  Any
> advice would be appreciated.
>
> Thanks,
> -Mike
>
>
> On Apr 26, 2013, at 10:14 AM, Michael Theroux wrote:
>
> Thanks.
>
> We weren't monitoring this value when the issue occurred, and this
> particular issue has not appeared for a couple of days (knock on wood).
>  Will keep an eye out though,
>
> -Mike
>
> On Apr 26, 2013, at 5:32 AM, Jason Wee wrote:
>
> top command? st : time stolen from this vm by the hypervisor
>
> jason
>
>
> On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux <mt...@yahoo.com>wrote:
>
>> Sorry, Not sure what CPU steal is :)
>>
>> I have AWS console with detailed monitoring enabled... things seem to
>> track close to the minute, so I can see the CPU load go to 0... then jump
>> at about the minute Cassandra reports the dropped messages,
>>
>> -Mike
>>
>> On Apr 25, 2013, at 9:50 PM, aaron morton wrote:
>>
>> The messages appear right after the node "wakes up".
>>
>> Are you tracking CPU steal ?
>>
>>    -----------------
>> Aaron Morton
>> Freelance Cassandra Consultant
>> New Zealand
>>
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 25/04/2013, at 4:15 AM, Robert Coli <rc...@eventbrite.com> wrote:
>>
>> On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux <mt...@yahoo.com>
>> wrote:
>>
>> Another related question.  Once we see messages being dropped on one
>> node, our cassandra client appears to see this, reporting errors.  We use
>> LOCAL_QUORUM with a RF of 3 on all queries.  Any idea why clients would see
>> an error?  If only one node reports an error, shouldn't the consistency
>> level prevent the client from seeing an issue?
>>
>>
>> If the client is talking to a broken/degraded coordinator node, RF/CL
>> are unable to protect it from RPCTimeout. If it is unable to
>> coordinate the request in a timely fashion, your clients will get
>> errors.
>>
>> =Rob
>>
>>
>>
>>
>
>
>
>

Re: Really odd issue (AWS related?)

Posted by Michael Theroux <mt...@yahoo.com>.

I forgot to mention,

When things go really bad, I'm seeing I/O waits in the 80->95% range.  I restarted cassandra once when a node is in this situation, and it took 45 minutes to start (primarily reading SSTables).  Typically, a node would start in about 5 minutes.

Thanks,
-Mike
 
On Apr 28, 2013, at 12:37 PM, Michael Theroux wrote:

> Hello,
> 
> We've done some additional monitoring, and I think we have more information.  We've been collecting vmstat information every minute, attempting to catch  a node with issues,.
> 
> So, it appears, that the cassandra node runs fine.  Then suddenly, without any correlation to any event that I can identify, the I/O wait time goes way up, and stays up indefinitely.  Even non-cassandra  I/O activities (such as snapshots and backups) start causing large I/O Wait times when they typically would not.  Previous to an issue, we would typically see I/O wait times 3-4% with very few blocked processes on I/O.  Once this issue manifests itself, i/O wait times for the same activities jump to 30-40% with many blocked processes.  The I/O wait times do go back down when there is literally no activity.   
> 
> -  Updating the node to the latest Amazon Linux patches and rebooting the instance doesn't correct the issue.
> -  Backing up the node, and replacing the instance does correct the issue.  I/O wait times return to normal.
> 
> One relatively recent change we've made is we upgraded to m1.xlarge instances which has 4 ephemeral drives available.  We create a logical volume from the 4 drives with the idea that we should be able to get increased I/O throughput.  When we ran m1.large instances, we had the same setup, although it was only using 2 ephemeral drives.  We chose to use LVM, vs. madm because we were having issues having madm create the raid volume reliably on restart (and research showed that this was a common problem).  LVM just worked (and had worked for months before this upgrade)..
> 
> For reference, this is the script we used to create the logical volume:
> 
> vgcreate mnt_vg /dev/sdb /dev/sdc /dev/sdd /dev/sde
> lvcreate -L 1600G -n mnt_lv -i 4 mnt_vg -I 256K
> blockdev --setra 65536 /dev/mnt_vg/mnt_lv
> sleep 2
> mkfs.xfs /dev/mnt_vg/mnt_lv
> sleep 3
> mkdir -p /data && mount -t xfs -o noatime /dev/mnt_vg/mnt_lv /data
> sleep 3
> 
> Another tidbit... thus far (and this maybe only a coincidence), we've only had to replace DB nodes within a single availability zone within us-east.  Other availability zones, in the same region, have yet to show an issue.
> 
> It looks like I'm going to need to replace a third DB node today.  Any advice would be appreciated.
> 
> Thanks,
> -Mike
> 
> 
> On Apr 26, 2013, at 10:14 AM, Michael Theroux wrote:
> 
>> Thanks.
>> 
>> We weren't monitoring this value when the issue occurred, and this particular issue has not appeared for a couple of days (knock on wood).  Will keep an eye out though,
>> 
>> -Mike
>> 
>> On Apr 26, 2013, at 5:32 AM, Jason Wee wrote:
>> 
>>> top command? st : time stolen from this vm by the hypervisor
>>> 
>>> jason
>>> 
>>> 
>>> On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux <mt...@yahoo.com> wrote:
>>> Sorry, Not sure what CPU steal is :)
>>> 
>>> I have AWS console with detailed monitoring enabled... things seem to track close to the minute, so I can see the CPU load go to 0... then jump at about the minute Cassandra reports the dropped messages,
>>> 
>>> -Mike
>>> 
>>> On Apr 25, 2013, at 9:50 PM, aaron morton wrote:
>>> 
>>>>> The messages appear right after the node "wakes up".
>>>> Are you tracking CPU steal ? 
>>>> 
>>>> -----------------
>>>> Aaron Morton
>>>> Freelance Cassandra Consultant
>>>> New Zealand
>>>> 
>>>> @aaronmorton
>>>> http://www.thelastpickle.com
>>>> 
>>>> On 25/04/2013, at 4:15 AM, Robert Coli <rc...@eventbrite.com> wrote:
>>>> 
>>>>> On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux <mt...@yahoo.com> wrote:
>>>>>> Another related question.  Once we see messages being dropped on one node, our cassandra client appears to see this, reporting errors.  We use LOCAL_QUORUM with a RF of 3 on all queries.  Any idea why clients would see an error?  If only one node reports an error, shouldn't the consistency level prevent the client from seeing an issue?
>>>>> 
>>>>> If the client is talking to a broken/degraded coordinator node, RF/CL
>>>>> are unable to protect it from RPCTimeout. If it is unable to
>>>>> coordinate the request in a timely fashion, your clients will get
>>>>> errors.
>>>>> 
>>>>> =Rob
>>>> 
>>> 
>>> 
>> 
>

Re: Really odd issue (AWS related?)

Posted by Michael Theroux <mt...@yahoo.com>.

Hello,

We've done some additional monitoring, and I think we have more information.  We've been collecting vmstat information every minute, attempting to catch  a node with issues,.

So, it appears, that the cassandra node runs fine.  Then suddenly, without any correlation to any event that I can identify, the I/O wait time goes way up, and stays up indefinitely.  Even non-cassandra  I/O activities (such as snapshots and backups) start causing large I/O Wait times when they typically would not.  Previous to an issue, we would typically see I/O wait times 3-4% with very few blocked processes on I/O.  Once this issue manifests itself, i/O wait times for the same activities jump to 30-40% with many blocked processes.  The I/O wait times do go back down when there is literally no activity.   

-  Updating the node to the latest Amazon Linux patches and rebooting the instance doesn't correct the issue.
-  Backing up the node, and replacing the instance does correct the issue.  I/O wait times return to normal.

One relatively recent change we've made is we upgraded to m1.xlarge instances which has 4 ephemeral drives available.  We create a logical volume from the 4 drives with the idea that we should be able to get increased I/O throughput.  When we ran m1.large instances, we had the same setup, although it was only using 2 ephemeral drives.  We chose to use LVM, vs. madm because we were having issues having madm create the raid volume reliably on restart (and research showed that this was a common problem).  LVM just worked (and had worked for months before this upgrade)..

For reference, this is the script we used to create the logical volume:

vgcreate mnt_vg /dev/sdb /dev/sdc /dev/sdd /dev/sde
lvcreate -L 1600G -n mnt_lv -i 4 mnt_vg -I 256K
blockdev --setra 65536 /dev/mnt_vg/mnt_lv
sleep 2
mkfs.xfs /dev/mnt_vg/mnt_lv
sleep 3
mkdir -p /data && mount -t xfs -o noatime /dev/mnt_vg/mnt_lv /data
sleep 3

Another tidbit... thus far (and this maybe only a coincidence), we've only had to replace DB nodes within a single availability zone within us-east.  Other availability zones, in the same region, have yet to show an issue.

It looks like I'm going to need to replace a third DB node today.  Any advice would be appreciated.

Thanks,
-Mike

On Apr 26, 2013, at 10:14 AM, Michael Theroux wrote:

> Thanks.
> 
> We weren't monitoring this value when the issue occurred, and this particular issue has not appeared for a couple of days (knock on wood).  Will keep an eye out though,
> 
> -Mike
> 
> On Apr 26, 2013, at 5:32 AM, Jason Wee wrote:
> 
>> top command? st : time stolen from this vm by the hypervisor
>> 
>> jason
>> 
>> 
>> On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux <mt...@yahoo.com> wrote:
>> Sorry, Not sure what CPU steal is :)
>> 
>> I have AWS console with detailed monitoring enabled... things seem to track close to the minute, so I can see the CPU load go to 0... then jump at about the minute Cassandra reports the dropped messages,
>> 
>> -Mike
>> 
>> On Apr 25, 2013, at 9:50 PM, aaron morton wrote:
>> 
>>>> The messages appear right after the node "wakes up".
>>> Are you tracking CPU steal ? 
>>> 
>>> -----------------
>>> Aaron Morton
>>> Freelance Cassandra Consultant
>>> New Zealand
>>> 
>>> @aaronmorton
>>> http://www.thelastpickle.com
>>> 
>>> On 25/04/2013, at 4:15 AM, Robert Coli <rc...@eventbrite.com> wrote:
>>> 
>>>> On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux <mt...@yahoo.com> wrote:
>>>>> Another related question.  Once we see messages being dropped on one node, our cassandra client appears to see this, reporting errors.  We use LOCAL_QUORUM with a RF of 3 on all queries.  Any idea why clients would see an error?  If only one node reports an error, shouldn't the consistency level prevent the client from seeing an issue?
>>>> 
>>>> If the client is talking to a broken/degraded coordinator node, RF/CL
>>>> are unable to protect it from RPCTimeout. If it is unable to
>>>> coordinate the request in a timely fashion, your clients will get
>>>> errors.
>>>> 
>>>> =Rob
>>> 
>> 
>> 
>

Re: Really odd issue (AWS related?)

Posted by Michael Theroux <mt...@yahoo.com>.

Thanks.

We weren't monitoring this value when the issue occurred, and this particular issue has not appeared for a couple of days (knock on wood).  Will keep an eye out though,

-Mike

On Apr 26, 2013, at 5:32 AM, Jason Wee wrote:

> top command? st : time stolen from this vm by the hypervisor
> 
> jason
> 
> 
> On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux <mt...@yahoo.com> wrote:
> Sorry, Not sure what CPU steal is :)
> 
> I have AWS console with detailed monitoring enabled... things seem to track close to the minute, so I can see the CPU load go to 0... then jump at about the minute Cassandra reports the dropped messages,
> 
> -Mike
> 
> On Apr 25, 2013, at 9:50 PM, aaron morton wrote:
> 
>>> The messages appear right after the node "wakes up".
>> Are you tracking CPU steal ? 
>> 
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Consultant
>> New Zealand
>> 
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 25/04/2013, at 4:15 AM, Robert Coli <rc...@eventbrite.com> wrote:
>> 
>>> On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux <mt...@yahoo.com> wrote:
>>>> Another related question.  Once we see messages being dropped on one node, our cassandra client appears to see this, reporting errors.  We use LOCAL_QUORUM with a RF of 3 on all queries.  Any idea why clients would see an error?  If only one node reports an error, shouldn't the consistency level prevent the client from seeing an issue?
>>> 
>>> If the client is talking to a broken/degraded coordinator node, RF/CL
>>> are unable to protect it from RPCTimeout. If it is unable to
>>> coordinate the request in a timely fashion, your clients will get
>>> errors.
>>> 
>>> =Rob
>> 
> 
>

Re: Really odd issue (AWS related?)

Posted by Jason Wee <pe...@gmail.com>.

top command? st : time stolen from this vm by the hypervisor

jason


On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux <mt...@yahoo.com>wrote:

> Sorry, Not sure what CPU steal is :)
>
> I have AWS console with detailed monitoring enabled... things seem to
> track close to the minute, so I can see the CPU load go to 0... then jump
> at about the minute Cassandra reports the dropped messages,
>
> -Mike
>
> On Apr 25, 2013, at 9:50 PM, aaron morton wrote:
>
> The messages appear right after the node "wakes up".
>
> Are you tracking CPU steal ?
>
> -----------------
> Aaron Morton
> Freelance Cassandra Consultant
> New Zealand
>
> @aaronmorton
> http://www.thelastpickle.com
>
> On 25/04/2013, at 4:15 AM, Robert Coli <rc...@eventbrite.com> wrote:
>
> On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux <mt...@yahoo.com>
> wrote:
>
> Another related question.  Once we see messages being dropped on one node,
> our cassandra client appears to see this, reporting errors.  We use
> LOCAL_QUORUM with a RF of 3 on all queries.  Any idea why clients would see
> an error?  If only one node reports an error, shouldn't the consistency
> level prevent the client from seeing an issue?
>
>
> If the client is talking to a broken/degraded coordinator node, RF/CL
> are unable to protect it from RPCTimeout. If it is unable to
> coordinate the request in a timely fashion, your clients will get
> errors.
>
> =Rob
>
>
>
>

Re: Really odd issue (AWS related?)

Posted by Michael Theroux <mt...@yahoo.com>.

Sorry, Not sure what CPU steal is :)

I have AWS console with detailed monitoring enabled... things seem to track close to the minute, so I can see the CPU load go to 0... then jump at about the minute Cassandra reports the dropped messages,

-Mike

On Apr 25, 2013, at 9:50 PM, aaron morton wrote:

>> The messages appear right after the node "wakes up".
> Are you tracking CPU steal ? 
> 
> -----------------
> Aaron Morton
> Freelance Cassandra Consultant
> New Zealand
> 
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 25/04/2013, at 4:15 AM, Robert Coli <rc...@eventbrite.com> wrote:
> 
>> On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux <mt...@yahoo.com> wrote:
>>> Another related question.  Once we see messages being dropped on one node, our cassandra client appears to see this, reporting errors.  We use LOCAL_QUORUM with a RF of 3 on all queries.  Any idea why clients would see an error?  If only one node reports an error, shouldn't the consistency level prevent the client from seeing an issue?
>> 
>> If the client is talking to a broken/degraded coordinator node, RF/CL
>> are unable to protect it from RPCTimeout. If it is unable to
>> coordinate the request in a timely fashion, your clients will get
>> errors.
>> 
>> =Rob
>

Re: Really odd issue (AWS related?)

Posted by aaron morton <aa...@thelastpickle.com>.

> The messages appear right after the node "wakes up".
Are you tracking CPU steal ? 

-----------------
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 25/04/2013, at 4:15 AM, Robert Coli <rc...@eventbrite.com> wrote:

> On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux <mt...@yahoo.com> wrote:
>> Another related question.  Once we see messages being dropped on one node, our cassandra client appears to see this, reporting errors.  We use LOCAL_QUORUM with a RF of 3 on all queries.  Any idea why clients would see an error?  If only one node reports an error, shouldn't the consistency level prevent the client from seeing an issue?
> 
> If the client is talking to a broken/degraded coordinator node, RF/CL
> are unable to protect it from RPCTimeout. If it is unable to
> coordinate the request in a timely fashion, your clients will get
> errors.
> 
> =Rob

Re: Really odd issue (AWS related?)

Posted by Robert Coli <rc...@eventbrite.com>.

On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux <mt...@yahoo.com> wrote:
> Another related question.  Once we see messages being dropped on one node, our cassandra client appears to see this, reporting errors.  We use LOCAL_QUORUM with a RF of 3 on all queries.  Any idea why clients would see an error?  If only one node reports an error, shouldn't the consistency level prevent the client from seeing an issue?

If the client is talking to a broken/degraded coordinator node, RF/CL
are unable to protect it from RPCTimeout. If it is unable to
coordinate the request in a timely fashion, your clients will get
errors.

=Rob