You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Philippe Dupont <pd...@teads.tv> on 2013/12/05 16:42:20 UTC

Re: Raid Issue on EC2 Datastax ami, 1.2.11

Hi again,

I have much more in formations on this case :

We did further investigations on the nodes affected and did find some await
problems on one of the 4 disk in raid:
http://imageshack.com/a/img824/2391/s7q3.jpg

Here was the iostat of the node :
http://imageshack.us/a/img7/7282/qq3w.png<http://www.google.com/url?q=http%3A%2F%2Fimageshack.us%2Fa%2Fimg7%2F7282%2Fqq3w.png&sa=D&sntz=1&usg=AFQjCNGTu2l8P6sedK0Wc9lhoI6_3O3ixw>

You can see that the write and read throughput are exactly the same on the
4 disks of the instance. So the raid0 looks good enough. Yet, the global
await, r_await and w_await are 3 to 5 times bigger on xvde disk than in
other disks.

We reported this to amazon support, and there is their answer :
" Hello,
I deeply apologize for any inconvenience this has been causing you and
thank you for the additional information and screenshots. Using the
instance you based your "iostat" on ("i-xxxxxxxx"), I have looked into the
underlying hardware it is currently using and I can see it appears to have
a noisy neighbor leading to the higher "await" time on that particular
device. Since most AWS services are multi-tenant, situations can arise
where one customer's resource has the potential to impact the performance
of a different customer's resource that reside on the same underlying
hardware (a "noisy neighbor"). While these occurrences are rare, they are
nonetheless inconvenient and I am very sorry for any impact it has created.
I have also looked into the initial instance referred to when the case was
created ("i-xxxxxxx") and cannot see any existing issues (neighboring or
otherwise) as to any I/O performance impacts; however, at the time the case
was created, evidence on our end suggests there was a noisy neighbor then
as well. Can you verify if you are still experiencing above average "await"
times on this instance? If you would like to mitigate the impact of
encountering "noisy neighbors", you can look into our Dedicated Instance
option; Dedicated Instances launch on hardware dedicated to only a single
customer (though this can feasibly lead to a situation where a customer is
their own noisy neighbor). However, this is an option available only to
instances that are being launched into a VPC and may require modification
of the architecture of your use-case. I understand the instances belonging
to your cluster in question have been launched into EC2-Classic, I just
wanted to bring this your attention as a possible solution. You can read
more about Dedicated Instances here:
http://aws.amazon.com/dedicated-instances/ Again, I am very sorry for the
performance impact you have been experiencing due to having noisy
neighbors. We understand the frustration and are always actively working to
increase capacity so the effects of noisy neighbors is lessened. I hope
this information has been useful and if you have any additional questions
whatsoever, please do not hesitate to ask! "

To conclude, the only other solution to avoid VPC and Reserved Instance is
to replace this instance by a new one, hoping to not having other "Noisy
neighbors"...
I hope that will help someone.

Philippe


2013/11/28 Philippe DUPONT <pd...@teads.tv>

> Hi,
>
> We have a Cassandra cluster of 28 nodes. Each one is an EC2 m1.xLarge
> based on datastax AMI with 4 storage in raid0 mode.
>
> Here is the ticket we opened with amazon support :
>
> "This raid is created using the datastax public AMI : ami-b2212dc6.
> Sources are also available here : https://github.com/riptano/ComboAMI
>
> As you can see in the screenshot attached (
> http://imageshack.com/a/img854/4592/xbqc.jpg)  randomly but frequently
> one of the storage get fully used (100%) but 3 others are standing in low
> use.
>
> Because of this, the node becomes slow and the whole cassandra cluster is
> impacted. We are losing data due to writes fails and availability for our
> customers.
>
> it was in this state for one hour, and we decided to restart it.
>
> We already removed 3 other instances because of this same issue."
> (see other screenshots)
> http://imageshack.com/a/img824/2391/s7q3.jpg
> http://imageshack.com/a/img10/556/zzk8.jpg
>
> Amazon support took a close look at the instance as well as it's
> underlying hardware for any potential health issues and both seem to be
> healthy.
>
> Have someone already experienced something like this ?
>
> Should I contact the AMI author better?
>
> Thanks a lot,
>
> Philippe.
>
>
>
>

Re: Raid Issue on EC2 Datastax ami, 1.2.11

Posted by Philippe Dupont <pd...@teads.tv>.

Hi Aaron,

As you can see in the picture, there is not much steal on iostat. That's
the same with top.
https://imageshack.com/i/0jm4jyp


Philippe


2013/12/10 Aaron Morton <aa...@thelastpickle.com>

> Thanks for the update Philip, other people have reported high await on a
> single volume previously but I don’t think it’s been blamed on noisy
> neighbours. It’s interesting that you can have noisy neighbours for IO only.
>
> Out of interest was there much steal reported in top or iostat ?
>
> Cheers
>
> -----------------
> Aaron Morton
> New Zealand
> @aaronmorton
>
> Co-Founder & Principal Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> On 6/12/2013, at 4:42 am, Philippe Dupont <pd...@teads.tv> wrote:
>
> Hi again,
>
> I have much more in formations on this case :
>
> We did further investigations on the nodes affected and did find some
> await problems on one of the 4 disk in raid:
> http://imageshack.com/a/img824/2391/s7q3.jpg
>
> Here was the iostat of the node :
> http://imageshack.us/a/img7/7282/qq3w.png<http://www.google.com/url?q=http%3A%2F%2Fimageshack.us%2Fa%2Fimg7%2F7282%2Fqq3w.png&sa=D&sntz=1&usg=AFQjCNGTu2l8P6sedK0Wc9lhoI6_3O3ixw>
>
> You can see that the write and read throughput are exactly the same on the
> 4 disks of the instance. So the raid0 looks good enough. Yet, the global
> await, r_await and w_await are 3 to 5 times bigger on xvde disk than in
> other disks.
>
> We reported this to amazon support, and there is their answer :
> " Hello,
> I deeply apologize for any inconvenience this has been causing you and
> thank you for the additional information and screenshots. Using the
> instance you based your "iostat" on ("i-xxxxxxxx"), I have looked into the
> underlying hardware it is currently using and I can see it appears to have
> a noisy neighbor leading to the higher "await" time on that particular
> device. Since most AWS services are multi-tenant, situations can arise
> where one customer's resource has the potential to impact the performance
> of a different customer's resource that reside on the same underlying
> hardware (a "noisy neighbor"). While these occurrences are rare, they are
> nonetheless inconvenient and I am very sorry for any impact it has created.
> I have also looked into the initial instance referred to when the case was
> created ("i-xxxxxxx") and cannot see any existing issues (neighboring or
> otherwise) as to any I/O performance impacts; however, at the time the case
> was created, evidence on our end suggests there was a noisy neighbor then
> as well. Can you verify if you are still experiencing above average "await"
> times on this instance? If you would like to mitigate the impact of
> encountering "noisy neighbors", you can look into our Dedicated Instance
> option; Dedicated Instances launch on hardware dedicated to only a single
> customer (though this can feasibly lead to a situation where a customer is
> their own noisy neighbor). However, this is an option available only to
> instances that are being launched into a VPC and may require modification
> of the architecture of your use-case. I understand the instances belonging
> to your cluster in question have been launched into EC2-Classic, I just
> wanted to bring this your attention as a possible solution. You can read
> more about Dedicated Instances here:
> http://aws.amazon.com/dedicated-instances/ Again, I am very sorry for the
> performance impact you have been experiencing due to having noisy
> neighbors. We understand the frustration and are always actively working to
> increase capacity so the effects of noisy neighbors is lessened. I hope
> this information has been useful and if you have any additional questions
> whatsoever, please do not hesitate to ask! "
>
> To conclude, the only other solution to avoid VPC and Reserved Instance is
> to replace this instance by a new one, hoping to not having other "Noisy
> neighbors"...
> I hope that will help someone.
>
> Philippe
>
>
> 2013/11/28 Philippe DUPONT <pd...@teads.tv>
>
>> Hi,
>>
>> We have a Cassandra cluster of 28 nodes. Each one is an EC2 m1.xLarge
>> based on datastax AMI with 4 storage in raid0 mode.
>>
>> Here is the ticket we opened with amazon support :
>>
>> "This raid is created using the datastax public AMI : ami-b2212dc6.
>> Sources are also available here : https://github.com/riptano/ComboAMI
>>
>> As you can see in the screenshot attached (
>> http://imageshack.com/a/img854/4592/xbqc.jpg)  randomly but frequently
>> one of the storage get fully used (100%) but 3 others are standing in low
>> use.
>>
>> Because of this, the node becomes slow and the whole cassandra cluster is
>> impacted. We are losing data due to writes fails and availability for our
>> customers.
>>
>> it was in this state for one hour, and we decided to restart it.
>>
>> We already removed 3 other instances because of this same issue."
>> (see other screenshots)
>> http://imageshack.com/a/img824/2391/s7q3.jpg
>> http://imageshack.com/a/img10/556/zzk8.jpg
>>
>> Amazon support took a close look at the instance as well as it's
>> underlying hardware for any potential health issues and both seem to be
>> healthy.
>>
>> Have someone already experienced something like this ?
>>
>> Should I contact the AMI author better?
>>
>> Thanks a lot,
>>
>> Philippe.
>>
>>
>>
>>
>
>

Re: Raid Issue on EC2 Datastax ami, 1.2.11

Posted by Aaron Morton <aa...@thelastpickle.com>.

Thanks for the update Philip, other people have reported high await on a single volume previously but I don’t think it’s been blamed on noisy neighbours. It’s interesting that you can have noisy neighbours for IO only.

Out of interest was there much steal reported in top or iostat ? 

Cheers

-----------------
Aaron Morton
New Zealand
@aaronmorton

Co-Founder & Principal Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On 6/12/2013, at 4:42 am, Philippe Dupont <pd...@teads.tv> wrote:

> Hi again,
> 
> I have much more in formations on this case :
> 
> We did further investigations on the nodes affected and did find some await problems on one of the 4 disk in raid:
> http://imageshack.com/a/img824/2391/s7q3.jpg
> 
> Here was the iostat of the node :
> http://imageshack.us/a/img7/7282/qq3w.png
> 
> You can see that the write and read throughput are exactly the same on the 4 disks of the instance. So the raid0 looks good enough. Yet, the global await, r_await and w_await are 3 to 5 times bigger on xvde disk than in other disks.
> 
> We reported this to amazon support, and there is their answer :
> " Hello,
> 
> I deeply apologize for any inconvenience this has been causing you and thank you for the additional information and screenshots.
> 
> Using the instance you based your "iostat" on ("i-xxxxxxxx"), I have looked into the underlying hardware it is currently using and I can see it appears to have a noisy neighbor leading to the higher "await" time on that particular device.  Since most AWS services are multi-tenant, situations can arise where one customer's resource has the potential to impact the performance of a different customer's resource that reside on the same underlying hardware (a "noisy neighbor").  While these occurrences are rare, they are nonetheless inconvenient and I am very sorry for any impact it has created.
> 
> I have also looked into the initial instance referred to when the case was created ("i-xxxxxxx") and cannot see any existing issues (neighboring or otherwise) as to any I/O performance impacts; however, at the time the case was created, evidence on our end suggests there was a noisy neighbor then as well.  Can you verify if you are still experiencing above average "await" times on this instance?
> 
> If you would like to mitigate the impact of encountering "noisy neighbors", you can look into our Dedicated Instance option; Dedicated Instances launch on hardware dedicated to only a single customer (though this can feasibly lead to a situation where a customer is their own noisy neighbor).  However, this is an option available only to instances that are being launched into a VPC and may require modification of the architecture of your use-case.  I understand the instances belonging to your cluster in question have been launched into EC2-Classic, I just wanted to bring this your attention as a possible solution.  You can read more about Dedicated Instances here:
> http://aws.amazon.com/dedicated-instances/
> 
> Again, I am very sorry for the performance impact you have been experiencing due to having noisy neighbors.  We understand the frustration and are always actively working to increase capacity so the effects of noisy neighbors is lessened.  I hope this information has been useful and if you have any additional questions whatsoever, please do not hesitate to ask! "
> 
> To conclude, the only other solution to avoid VPC and Reserved Instance is to replace this instance by a new one, hoping to not having other "Noisy neighbors"...
> I hope that will help someone.
> 
> Philippe
> 
> 
> 2013/11/28 Philippe DUPONT <pd...@teads.tv>
> Hi,
> 
> We have a Cassandra cluster of 28 nodes. Each one is an EC2 m1.xLarge based on datastax AMI with 4 storage in raid0 mode.
> 
> Here is the ticket we opened with amazon support :
> 
> "This raid is created using the datastax public AMI : ami-b2212dc6. Sources are also available here : https://github.com/riptano/ComboAMI
> 
> As you can see in the screenshot attached (http://imageshack.com/a/img854/4592/xbqc.jpg)  randomly but frequently one of the storage get fully used (100%) but 3 others are standing in low use.
> 
> Because of this, the node becomes slow and the whole cassandra cluster is impacted. We are losing data due to writes fails and availability for our customers.
> 
> it was in this state for one hour, and we decided to restart it.
> 
> We already removed 3 other instances because of this same issue."
> (see other screenshots)
> http://imageshack.com/a/img824/2391/s7q3.jpg
> http://imageshack.com/a/img10/556/zzk8.jpg
> 
> Amazon support took a close look at the instance as well as it's underlying hardware for any potential health issues and both seem to be healthy.
> 
> Have someone already experienced something like this ?
> 
> Should I contact the AMI author better?
> 
> Thanks a lot,
> 
> Philippe.
> 
> 
> 
>