You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Alain RODRIGUEZ <ar...@gmail.com> on 2013/07/09 16:11:45 UTC

High performance hardware with lot of data per node - Global learning about configuration

Hi,

Using C*1.2.2.

We recently dropped our 18 m1.xLarge (4CPU, 15GB RAM, 4 Raid-0 Disks)
servers to get 3 hi1.4xLarge (16CPU, 60GB RAM, 2 Raid-0 SSD) servers
instead, for about the same price.

We tried it after reading some benchmark published by Netflix.

It is awesome and I recommend it to anyone who is using more than 18 xLarge
server or can afford these high cost / high performance EC2 instances. SSD
gives a very good throughput with an awesome latency.

Yet, we had about 200 GB data per server and now about 1 TB.

To alleviate memory pressure inside the heap I had to reduce the index
sampling. I changed the index_interval value from 128 to 512, with no
visible impact on latency, but a great improvement inside the heap which
doesn't complain about any pressure anymore.

Is there some more tuning I could use, more tricks that could be useful
while using big servers, with a lot of data per node and relatively high
throughput ?

SSD are at 20-40 % of their throughput capacity (according to OpsCenter),
CPU almost never reach a bigger load than 5 or 6 (with 16 CPU), 15 GB RAM
used out of 60GB.

At this point I have kept my previous configuration, which is almost the
default one from the Datastax community AMI. There is a part of it, you can
consider that any property that is not in here is configured as default :

cassandra.yaml

key_cache_size_in_mb: (empty) - so default - 100MB (hit rate between 88 %
and 92 %, good enough ?)
row_cache_size_in_mb: 0 (not usable in our use case, a lot of different and
random reads)
flush_largest_memtables_at: 0.80
reduce_cache_sizes_at: 0.90

concurrent_reads: 32 (I am thinking to increase this to 64 or more since I
have just a few servers to handle more concurrence)
concurrent_writes: 32 (I am thinking to increase this to 64 or more too)
memtable_total_space_in_mb: 1024 (to avoid having a full heap, shoul I use
bigger value, why for ?)

rpc_server_type: sync (I tried hsha and had the "ERROR 12:02:18,971 Read an
invalid frame size of 0. Are you using TFramedTransport on the client
side?" error). No idea how to fix this, and I use 5 different clients for
different purpose  (Hector, Cassie, phpCassa, Astyanax, Helenus)...

multithreaded_compaction: false (Should I try enabling this since I now use
SSD ?)
compaction_throughput_mb_per_sec: 16 (I will definitely up this to 32 or
even more)

cross_node_timeout: true
endpoint_snitch: Ec2MultiRegionSnitch

index_interval: 512

cassandra-env.sh

I am not sure about how to tune the heap, so I mainly use defaults

MAX_HEAP_SIZE="8G"
HEAP_NEWSIZE="400M" (I tried with higher values, and it produced bigger GC
times (1600 ms instead of < 200 ms now with 400M)

-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled
-XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=1
-XX:CMSInitiatingOccupancyFraction=70
-XX:+UseCMSInitiatingOccupancyOnly

Does this configuration seems coherent ? Right now, performance are
correct, latency < 5ms almost all the time. What can I do to handle more
data per node and keep these performances or get even better once ?

I know this is a long message but if you have any comment or insight even
on part of it, don't hesitate to share it. I guess this kind of comment on
configuration is usable by the entire community.

Alain

Re: High performance hardware with lot of data per node - Global learning about configuration

Posted by Alain RODRIGUEZ <ar...@gmail.com>.
This comment and some testing were enough for us.

"Generally, a value between 128 and 512 here coupled with a large key cache
size on CFs results in the best trade offs.  This value is not often
changed, however if you have many very small rows (many to an OS page),
then increasing this will often lower memory usage without a impact on
performance."

And indeed, I started using this config in only one node without seeing any
performance degradation. Mean reads latency was around 4 ms in all the
servers, including this one. And I had no more heap full. Heap used now
goes from 2.5 GB to 5.5 GB increasing slowly instead of getting stuck
around 5.0 GB and 6.5GB (out of 8GB Heap).

All the graph I could see while having both configurations (128/512) on
different servers were almost the same, excepted about the Heap.

So 512 was a lot better in our case.

Hope it will help you, since it was also the purpose of this thread.

Alain






2013/7/9 Mike Heffner <mi...@librato.com>

> I'm curious because we are experimenting with a very similar
> configuration, what basis did you use for expanding the index_interval to
> that value? Do you have before and after numbers or was it simply reduction
> of the heap pressure warnings that you looked for?
>
> thanks,
>
> Mike
>
>
> On Tue, Jul 9, 2013 at 10:11 AM, Alain RODRIGUEZ <ar...@gmail.com>wrote:
>
>> Hi,
>>
>> Using C*1.2.2.
>>
>> We recently dropped our 18 m1.xLarge (4CPU, 15GB RAM, 4 Raid-0 Disks)
>> servers to get 3 hi1.4xLarge (16CPU, 60GB RAM, 2 Raid-0 SSD) servers
>> instead, for about the same price.
>>
>> We tried it after reading some benchmark published by Netflix.
>>
>> It is awesome and I recommend it to anyone who is using more than 18
>> xLarge server or can afford these high cost / high performance EC2
>> instances. SSD gives a very good throughput with an awesome latency.
>>
>> Yet, we had about 200 GB data per server and now about 1 TB.
>>
>> To alleviate memory pressure inside the heap I had to reduce the index
>> sampling. I changed the index_interval value from 128 to 512, with no
>> visible impact on latency, but a great improvement inside the heap which
>> doesn't complain about any pressure anymore.
>>
>> Is there some more tuning I could use, more tricks that could be useful
>> while using big servers, with a lot of data per node and relatively high
>> throughput ?
>>
>> SSD are at 20-40 % of their throughput capacity (according to OpsCenter),
>> CPU almost never reach a bigger load than 5 or 6 (with 16 CPU), 15 GB RAM
>> used out of 60GB.
>>
>> At this point I have kept my previous configuration, which is almost the
>> default one from the Datastax community AMI. There is a part of it, you can
>> consider that any property that is not in here is configured as default :
>>
>> cassandra.yaml
>>
>> key_cache_size_in_mb: (empty) - so default - 100MB (hit rate between 88 %
>> and 92 %, good enough ?)
>> row_cache_size_in_mb: 0 (not usable in our use case, a lot of different
>> and random reads)
>> flush_largest_memtables_at: 0.80
>> reduce_cache_sizes_at: 0.90
>>
>> concurrent_reads: 32 (I am thinking to increase this to 64 or more since
>> I have just a few servers to handle more concurrence)
>> concurrent_writes: 32 (I am thinking to increase this to 64 or more too)
>> memtable_total_space_in_mb: 1024 (to avoid having a full heap, shoul I
>> use bigger value, why for ?)
>>
>> rpc_server_type: sync (I tried hsha and had the "ERROR 12:02:18,971 Read
>> an invalid frame size of 0. Are you using TFramedTransport on the client
>> side?" error). No idea how to fix this, and I use 5 different clients for
>> different purpose  (Hector, Cassie, phpCassa, Astyanax, Helenus)...
>>
>> multithreaded_compaction: false (Should I try enabling this since I now
>> use SSD ?)
>> compaction_throughput_mb_per_sec: 16 (I will definitely up this to 32 or
>> even more)
>>
>> cross_node_timeout: true
>> endpoint_snitch: Ec2MultiRegionSnitch
>>
>> index_interval: 512
>>
>> cassandra-env.sh
>>
>> I am not sure about how to tune the heap, so I mainly use defaults
>>
>> MAX_HEAP_SIZE="8G"
>> HEAP_NEWSIZE="400M" (I tried with higher values, and it produced bigger
>> GC times (1600 ms instead of < 200 ms now with 400M)
>>
>> -XX:+UseParNewGC
>> -XX:+UseConcMarkSweepGC
>> -XX:+CMSParallelRemarkEnabled
>> -XX:SurvivorRatio=8
>> -XX:MaxTenuringThreshold=1
>> -XX:CMSInitiatingOccupancyFraction=70
>> -XX:+UseCMSInitiatingOccupancyOnly
>>
>> Does this configuration seems coherent ? Right now, performance are
>> correct, latency < 5ms almost all the time. What can I do to handle more
>> data per node and keep these performances or get even better once ?
>>
>> I know this is a long message but if you have any comment or insight even
>> on part of it, don't hesitate to share it. I guess this kind of comment on
>> configuration is usable by the entire community.
>>
>> Alain
>>
>>
>
>
> --
>
>   Mike Heffner <mi...@librato.com>
>   Librato, Inc.
>
>

Re: High performance hardware with lot of data per node - Global learning about configuration

Posted by Mike Heffner <mi...@librato.com>.
I'm curious because we are experimenting with a very similar configuration,
what basis did you use for expanding the index_interval to that value? Do
you have before and after numbers or was it simply reduction of the heap
pressure warnings that you looked for?

thanks,

Mike


On Tue, Jul 9, 2013 at 10:11 AM, Alain RODRIGUEZ <ar...@gmail.com> wrote:

> Hi,
>
> Using C*1.2.2.
>
> We recently dropped our 18 m1.xLarge (4CPU, 15GB RAM, 4 Raid-0 Disks)
> servers to get 3 hi1.4xLarge (16CPU, 60GB RAM, 2 Raid-0 SSD) servers
> instead, for about the same price.
>
> We tried it after reading some benchmark published by Netflix.
>
> It is awesome and I recommend it to anyone who is using more than 18
> xLarge server or can afford these high cost / high performance EC2
> instances. SSD gives a very good throughput with an awesome latency.
>
> Yet, we had about 200 GB data per server and now about 1 TB.
>
> To alleviate memory pressure inside the heap I had to reduce the index
> sampling. I changed the index_interval value from 128 to 512, with no
> visible impact on latency, but a great improvement inside the heap which
> doesn't complain about any pressure anymore.
>
> Is there some more tuning I could use, more tricks that could be useful
> while using big servers, with a lot of data per node and relatively high
> throughput ?
>
> SSD are at 20-40 % of their throughput capacity (according to OpsCenter),
> CPU almost never reach a bigger load than 5 or 6 (with 16 CPU), 15 GB RAM
> used out of 60GB.
>
> At this point I have kept my previous configuration, which is almost the
> default one from the Datastax community AMI. There is a part of it, you can
> consider that any property that is not in here is configured as default :
>
> cassandra.yaml
>
> key_cache_size_in_mb: (empty) - so default - 100MB (hit rate between 88 %
> and 92 %, good enough ?)
> row_cache_size_in_mb: 0 (not usable in our use case, a lot of different
> and random reads)
> flush_largest_memtables_at: 0.80
> reduce_cache_sizes_at: 0.90
>
> concurrent_reads: 32 (I am thinking to increase this to 64 or more since I
> have just a few servers to handle more concurrence)
> concurrent_writes: 32 (I am thinking to increase this to 64 or more too)
> memtable_total_space_in_mb: 1024 (to avoid having a full heap, shoul I use
> bigger value, why for ?)
>
> rpc_server_type: sync (I tried hsha and had the "ERROR 12:02:18,971 Read
> an invalid frame size of 0. Are you using TFramedTransport on the client
> side?" error). No idea how to fix this, and I use 5 different clients for
> different purpose  (Hector, Cassie, phpCassa, Astyanax, Helenus)...
>
> multithreaded_compaction: false (Should I try enabling this since I now
> use SSD ?)
> compaction_throughput_mb_per_sec: 16 (I will definitely up this to 32 or
> even more)
>
> cross_node_timeout: true
> endpoint_snitch: Ec2MultiRegionSnitch
>
> index_interval: 512
>
> cassandra-env.sh
>
> I am not sure about how to tune the heap, so I mainly use defaults
>
> MAX_HEAP_SIZE="8G"
> HEAP_NEWSIZE="400M" (I tried with higher values, and it produced bigger GC
> times (1600 ms instead of < 200 ms now with 400M)
>
> -XX:+UseParNewGC
> -XX:+UseConcMarkSweepGC
> -XX:+CMSParallelRemarkEnabled
> -XX:SurvivorRatio=8
> -XX:MaxTenuringThreshold=1
> -XX:CMSInitiatingOccupancyFraction=70
> -XX:+UseCMSInitiatingOccupancyOnly
>
> Does this configuration seems coherent ? Right now, performance are
> correct, latency < 5ms almost all the time. What can I do to handle more
> data per node and keep these performances or get even better once ?
>
> I know this is a long message but if you have any comment or insight even
> on part of it, don't hesitate to share it. I guess this kind of comment on
> configuration is usable by the entire community.
>
> Alain
>
>


-- 

  Mike Heffner <mi...@librato.com>
  Librato, Inc.

Re: High performance hardware with lot of data per node - Global learning about configuration

Posted by Mike Heffner <mi...@librato.com>.
Aiman,

I believe that is one of the cases we added a check for:

https://github.com/librato/tablesnap/blob/master/tablesnap#L203-L207


Mike


On Thu, Jul 11, 2013 at 1:54 PM, Aiman Parvaiz <ai...@grapheffect.com>wrote:

> Thanks for the info Mike, we ran in to a race condition which was killing
> table snap, I want to share the problem and the solution/ work around and
> may be someone can throw some light on the effects of the solution.
>
> tablesnap was getting killed with this error message:
>
> Failed uploading %s. Aborting.\n%s"
>
> Looking at the code it took me to the following:
>
> def worker(self):
>         bucket = self.get_bucket()
>
>         while True:
>             f = self.fileq.get()
>             keyname = self.build_keyname(f)
>             try:
>                 self.upload_sstable(bucket, keyname, f)
>             except:
>                 self.log.critical("Failed uploading %s. Aborting.\n%s" %
>                              (f, format_exc()))
>                 # Brute force kill self
>                 os.kill(os.getpid(), signal.SIGKILL)
>
>             self.fileq.task_done()
>
>
> It builds the filename and then before it could upload it, the file
> disappears (which is possible), I simply commented out the line which kills
> tablesnap if the file is not found, it fixes the issue we were having but I
> would appreciate if some one has any insights on any ill effects this might
> have on backup or restoration process.
>
> Thanks
>
>
> On Jul 11, 2013, at 7:03 AM, Mike Heffner <mi...@librato.com> wrote:
>
> We've also noticed very good read and write latencies with the hi1.4xls
> compared to our previous instance classes. We actually ran a mixed cluster
> of hi1.4xls and m2.4xls to watch side-by-side comparison.
>
> Despite the significant improvement in underlying hardware, we've noticed
> that streaming performance with 1.2.6+vnodes is a lot slower than we would
> expect. Bootstrapping a node into a ring with large storage loads can take
> 6+ hours. We have a JIRA open that describes our current config:
> https://issues.apache.org/jira/browse/CASSANDRA-5726
>
> Aiman: We also use tablesnap for our backups. We're using a slightly
> modified version [1]. We currently backup every sst as soon as they hit
> disk (tablesnap's inotify), but we're considering moving to a periodic
> snapshot approach as the sst churn after going from 24 nodes -> 6 nodes is
> quite high.
>
> Mike
>
>
> [1]: https://github.com/librato/tablesnap
>
>
> On Thu, Jul 11, 2013 at 7:33 AM, Aiman Parvaiz <ai...@grapheffect.com>wrote:
>
>> Hi,
>> We also recently migrated to 3 hi.4xlarge boxes(Raid0 SSD) and the disk
>> IO performance is definitely better than the earlier non SSD servers, we
>> are serving up to 14k reads/s with a latency of 3-3.5 ms/op.
>> I wanted to share our config options and ask about the data back up
>> strategy for Raid0.
>>
>> We are using C* 1.2.6 with
>>
>> key_chache and row_cache of 300MB
>> I have not changed/ modified any other parameter except for going with
>> multithreaded GC. I will be playing around with other factors and update
>> everyone if I find something interesting.
>>
>> Also, just wanted to share backup strategy and see if I can get something
>> useful from how others are taking backup of their raid0. I am using
>> tablesnap to upload SSTables to s3 and I have attached a separate EBS
>> volume to every box and have set up rsync to mirror Cassandra data from
>> Raid0 to EBS. I would really appreciate if you guys can share how you
>> taking backups.
>>
>> Thanks
>>
>>
>> On Jul 9, 2013, at 7:11 AM, Alain RODRIGUEZ <ar...@gmail.com> wrote:
>>
>> > Hi,
>> >
>> > Using C*1.2.2.
>> >
>> > We recently dropped our 18 m1.xLarge (4CPU, 15GB RAM, 4 Raid-0 Disks)
>> servers to get 3 hi1.4xLarge (16CPU, 60GB RAM, 2 Raid-0 SSD) servers
>> instead, for about the same price.
>> >
>> > We tried it after reading some benchmark published by Netflix.
>> >
>> > It is awesome and I recommend it to anyone who is using more than 18
>> xLarge server or can afford these high cost / high performance EC2
>> instances. SSD gives a very good throughput with an awesome latency.
>> >
>> > Yet, we had about 200 GB data per server and now about 1 TB.
>> >
>> > To alleviate memory pressure inside the heap I had to reduce the index
>> sampling. I changed the index_interval value from 128 to 512, with no
>> visible impact on latency, but a great improvement inside the heap which
>> doesn't complain about any pressure anymore.
>> >
>> > Is there some more tuning I could use, more tricks that could be useful
>> while using big servers, with a lot of data per node and relatively high
>> throughput ?
>> >
>> > SSD are at 20-40 % of their throughput capacity (according to
>> OpsCenter), CPU almost never reach a bigger load than 5 or 6 (with 16 CPU),
>> 15 GB RAM used out of 60GB.
>> >
>> > At this point I have kept my previous configuration, which is almost
>> the default one from the Datastax community AMI. There is a part of it, you
>> can consider that any property that is not in here is configured as default
>> :
>> >
>> > cassandra.yaml
>> >
>> > key_cache_size_in_mb: (empty) - so default - 100MB (hit rate between 88
>> % and 92 %, good enough ?)
>> > row_cache_size_in_mb: 0 (not usable in our use case, a lot of different
>> and random reads)
>> > flush_largest_memtables_at: 0.80
>> > reduce_cache_sizes_at: 0.90
>> >
>> > concurrent_reads: 32 (I am thinking to increase this to 64 or more
>> since I have just a few servers to handle more concurrence)
>> > concurrent_writes: 32 (I am thinking to increase this to 64 or more too)
>> > memtable_total_space_in_mb: 1024 (to avoid having a full heap, shoul I
>> use bigger value, why for ?)
>> >
>> > rpc_server_type: sync (I tried hsha and had the "ERROR 12:02:18,971
>> Read an invalid frame size of 0. Are you using TFramedTransport on the
>> client side?" error). No idea how to fix this, and I use 5 different
>> clients for different purpose  (Hector, Cassie, phpCassa, Astyanax,
>> Helenus)...
>> >
>> > multithreaded_compaction: false (Should I try enabling this since I now
>> use SSD ?)
>> > compaction_throughput_mb_per_sec: 16 (I will definitely up this to 32
>> or even more)
>> >
>> > cross_node_timeout: true
>> > endpoint_snitch: Ec2MultiRegionSnitch
>> >
>> > index_interval: 512
>> >
>> > cassandra-env.sh
>> >
>> > I am not sure about how to tune the heap, so I mainly use defaults
>> >
>> > MAX_HEAP_SIZE="8G"
>> > HEAP_NEWSIZE="400M" (I tried with higher values, and it produced bigger
>> GC times (1600 ms instead of < 200 ms now with 400M)
>> >
>> > -XX:+UseParNewGC
>> > -XX:+UseConcMarkSweepGC
>> > -XX:+CMSParallelRemarkEnabled
>> > -XX:SurvivorRatio=8
>> > -XX:MaxTenuringThreshold=1
>> > -XX:CMSInitiatingOccupancyFraction=70
>> > -XX:+UseCMSInitiatingOccupancyOnly
>> >
>> > Does this configuration seems coherent ? Right now, performance are
>> correct, latency < 5ms almost all the time. What can I do to handle more
>> data per node and keep these performances or get even better once ?
>> >
>> > I know this is a long message but if you have any comment or insight
>> even on part of it, don't hesitate to share it. I guess this kind of
>> comment on configuration is usable by the entire community.
>> >
>> > Alain
>> >
>>
>>
>
>
> --
>
>   Mike Heffner <mi...@librato.com>
>   Librato, Inc.
>
>
>


-- 

  Mike Heffner <mi...@librato.com>
  Librato, Inc.

Re: High performance hardware with lot of data per node - Global learning about configuration

Posted by Aiman Parvaiz <ai...@grapheffect.com>.
Thanks for the info Mike, we ran in to a race condition which was killing table snap, I want to share the problem and the solution/ work around and may be someone can throw some light on the effects of the solution.

tablesnap was getting killed with this error message:

Failed uploading %s. Aborting.\n%s" 

Looking at the code it took me to the following:

def worker(self):
        bucket = self.get_bucket()

        while True:
            f = self.fileq.get()
            keyname = self.build_keyname(f)
            try:
                self.upload_sstable(bucket, keyname, f)
            except:
                self.log.critical("Failed uploading %s. Aborting.\n%s" %
                             (f, format_exc()))
                # Brute force kill self
                os.kill(os.getpid(), signal.SIGKILL)

            self.fileq.task_done()

It builds the filename and then before it could upload it, the file disappears (which is possible), I simply commented out the line which kills tablesnap if the file is not found, it fixes the issue we were having but I would appreciate if some one has any insights on any ill effects this might have on backup or restoration process.

Thanks


On Jul 11, 2013, at 7:03 AM, Mike Heffner <mi...@librato.com> wrote:

> We've also noticed very good read and write latencies with the hi1.4xls compared to our previous instance classes. We actually ran a mixed cluster of hi1.4xls and m2.4xls to watch side-by-side comparison.
> 
> Despite the significant improvement in underlying hardware, we've noticed that streaming performance with 1.2.6+vnodes is a lot slower than we would expect. Bootstrapping a node into a ring with large storage loads can take 6+ hours. We have a JIRA open that describes our current config: https://issues.apache.org/jira/browse/CASSANDRA-5726
> 
> Aiman: We also use tablesnap for our backups. We're using a slightly modified version [1]. We currently backup every sst as soon as they hit disk (tablesnap's inotify), but we're considering moving to a periodic snapshot approach as the sst churn after going from 24 nodes -> 6 nodes is quite high.
> 
> Mike
> 
> 
> [1]: https://github.com/librato/tablesnap
> 
> 
> On Thu, Jul 11, 2013 at 7:33 AM, Aiman Parvaiz <ai...@grapheffect.com> wrote:
> Hi,
> We also recently migrated to 3 hi.4xlarge boxes(Raid0 SSD) and the disk IO performance is definitely better than the earlier non SSD servers, we are serving up to 14k reads/s with a latency of 3-3.5 ms/op.
> I wanted to share our config options and ask about the data back up strategy for Raid0.
> 
> We are using C* 1.2.6 with
> 
> key_chache and row_cache of 300MB
> I have not changed/ modified any other parameter except for going with multithreaded GC. I will be playing around with other factors and update everyone if I find something interesting.
> 
> Also, just wanted to share backup strategy and see if I can get something useful from how others are taking backup of their raid0. I am using tablesnap to upload SSTables to s3 and I have attached a separate EBS volume to every box and have set up rsync to mirror Cassandra data from Raid0 to EBS. I would really appreciate if you guys can share how you taking backups.
> 
> Thanks
> 
> 
> On Jul 9, 2013, at 7:11 AM, Alain RODRIGUEZ <ar...@gmail.com> wrote:
> 
> > Hi,
> >
> > Using C*1.2.2.
> >
> > We recently dropped our 18 m1.xLarge (4CPU, 15GB RAM, 4 Raid-0 Disks) servers to get 3 hi1.4xLarge (16CPU, 60GB RAM, 2 Raid-0 SSD) servers instead, for about the same price.
> >
> > We tried it after reading some benchmark published by Netflix.
> >
> > It is awesome and I recommend it to anyone who is using more than 18 xLarge server or can afford these high cost / high performance EC2 instances. SSD gives a very good throughput with an awesome latency.
> >
> > Yet, we had about 200 GB data per server and now about 1 TB.
> >
> > To alleviate memory pressure inside the heap I had to reduce the index sampling. I changed the index_interval value from 128 to 512, with no visible impact on latency, but a great improvement inside the heap which doesn't complain about any pressure anymore.
> >
> > Is there some more tuning I could use, more tricks that could be useful while using big servers, with a lot of data per node and relatively high throughput ?
> >
> > SSD are at 20-40 % of their throughput capacity (according to OpsCenter), CPU almost never reach a bigger load than 5 or 6 (with 16 CPU), 15 GB RAM used out of 60GB.
> >
> > At this point I have kept my previous configuration, which is almost the default one from the Datastax community AMI. There is a part of it, you can consider that any property that is not in here is configured as default :
> >
> > cassandra.yaml
> >
> > key_cache_size_in_mb: (empty) - so default - 100MB (hit rate between 88 % and 92 %, good enough ?)
> > row_cache_size_in_mb: 0 (not usable in our use case, a lot of different and random reads)
> > flush_largest_memtables_at: 0.80
> > reduce_cache_sizes_at: 0.90
> >
> > concurrent_reads: 32 (I am thinking to increase this to 64 or more since I have just a few servers to handle more concurrence)
> > concurrent_writes: 32 (I am thinking to increase this to 64 or more too)
> > memtable_total_space_in_mb: 1024 (to avoid having a full heap, shoul I use bigger value, why for ?)
> >
> > rpc_server_type: sync (I tried hsha and had the "ERROR 12:02:18,971 Read an invalid frame size of 0. Are you using TFramedTransport on the client side?" error). No idea how to fix this, and I use 5 different clients for different purpose  (Hector, Cassie, phpCassa, Astyanax, Helenus)...
> >
> > multithreaded_compaction: false (Should I try enabling this since I now use SSD ?)
> > compaction_throughput_mb_per_sec: 16 (I will definitely up this to 32 or even more)
> >
> > cross_node_timeout: true
> > endpoint_snitch: Ec2MultiRegionSnitch
> >
> > index_interval: 512
> >
> > cassandra-env.sh
> >
> > I am not sure about how to tune the heap, so I mainly use defaults
> >
> > MAX_HEAP_SIZE="8G"
> > HEAP_NEWSIZE="400M" (I tried with higher values, and it produced bigger GC times (1600 ms instead of < 200 ms now with 400M)
> >
> > -XX:+UseParNewGC
> > -XX:+UseConcMarkSweepGC
> > -XX:+CMSParallelRemarkEnabled
> > -XX:SurvivorRatio=8
> > -XX:MaxTenuringThreshold=1
> > -XX:CMSInitiatingOccupancyFraction=70
> > -XX:+UseCMSInitiatingOccupancyOnly
> >
> > Does this configuration seems coherent ? Right now, performance are correct, latency < 5ms almost all the time. What can I do to handle more data per node and keep these performances or get even better once ?
> >
> > I know this is a long message but if you have any comment or insight even on part of it, don't hesitate to share it. I guess this kind of comment on configuration is usable by the entire community.
> >
> > Alain
> >
> 
> 
> 
> 
> -- 
> 
>   Mike Heffner <mi...@librato.com>
>   Librato, Inc.
> 


Re: High performance hardware with lot of data per node - Global learning about configuration

Posted by Mike Heffner <mi...@librato.com>.
We've also noticed very good read and write latencies with the hi1.4xls
compared to our previous instance classes. We actually ran a mixed cluster
of hi1.4xls and m2.4xls to watch side-by-side comparison.

Despite the significant improvement in underlying hardware, we've noticed
that streaming performance with 1.2.6+vnodes is a lot slower than we would
expect. Bootstrapping a node into a ring with large storage loads can take
6+ hours. We have a JIRA open that describes our current config:
https://issues.apache.org/jira/browse/CASSANDRA-5726

Aiman: We also use tablesnap for our backups. We're using a slightly
modified version [1]. We currently backup every sst as soon as they hit
disk (tablesnap's inotify), but we're considering moving to a periodic
snapshot approach as the sst churn after going from 24 nodes -> 6 nodes is
quite high.

Mike


[1]: https://github.com/librato/tablesnap


On Thu, Jul 11, 2013 at 7:33 AM, Aiman Parvaiz <ai...@grapheffect.com>wrote:

> Hi,
> We also recently migrated to 3 hi.4xlarge boxes(Raid0 SSD) and the disk IO
> performance is definitely better than the earlier non SSD servers, we are
> serving up to 14k reads/s with a latency of 3-3.5 ms/op.
> I wanted to share our config options and ask about the data back up
> strategy for Raid0.
>
> We are using C* 1.2.6 with
>
> key_chache and row_cache of 300MB
> I have not changed/ modified any other parameter except for going with
> multithreaded GC. I will be playing around with other factors and update
> everyone if I find something interesting.
>
> Also, just wanted to share backup strategy and see if I can get something
> useful from how others are taking backup of their raid0. I am using
> tablesnap to upload SSTables to s3 and I have attached a separate EBS
> volume to every box and have set up rsync to mirror Cassandra data from
> Raid0 to EBS. I would really appreciate if you guys can share how you
> taking backups.
>
> Thanks
>
>
> On Jul 9, 2013, at 7:11 AM, Alain RODRIGUEZ <ar...@gmail.com> wrote:
>
> > Hi,
> >
> > Using C*1.2.2.
> >
> > We recently dropped our 18 m1.xLarge (4CPU, 15GB RAM, 4 Raid-0 Disks)
> servers to get 3 hi1.4xLarge (16CPU, 60GB RAM, 2 Raid-0 SSD) servers
> instead, for about the same price.
> >
> > We tried it after reading some benchmark published by Netflix.
> >
> > It is awesome and I recommend it to anyone who is using more than 18
> xLarge server or can afford these high cost / high performance EC2
> instances. SSD gives a very good throughput with an awesome latency.
> >
> > Yet, we had about 200 GB data per server and now about 1 TB.
> >
> > To alleviate memory pressure inside the heap I had to reduce the index
> sampling. I changed the index_interval value from 128 to 512, with no
> visible impact on latency, but a great improvement inside the heap which
> doesn't complain about any pressure anymore.
> >
> > Is there some more tuning I could use, more tricks that could be useful
> while using big servers, with a lot of data per node and relatively high
> throughput ?
> >
> > SSD are at 20-40 % of their throughput capacity (according to
> OpsCenter), CPU almost never reach a bigger load than 5 or 6 (with 16 CPU),
> 15 GB RAM used out of 60GB.
> >
> > At this point I have kept my previous configuration, which is almost the
> default one from the Datastax community AMI. There is a part of it, you can
> consider that any property that is not in here is configured as default :
> >
> > cassandra.yaml
> >
> > key_cache_size_in_mb: (empty) - so default - 100MB (hit rate between 88
> % and 92 %, good enough ?)
> > row_cache_size_in_mb: 0 (not usable in our use case, a lot of different
> and random reads)
> > flush_largest_memtables_at: 0.80
> > reduce_cache_sizes_at: 0.90
> >
> > concurrent_reads: 32 (I am thinking to increase this to 64 or more since
> I have just a few servers to handle more concurrence)
> > concurrent_writes: 32 (I am thinking to increase this to 64 or more too)
> > memtable_total_space_in_mb: 1024 (to avoid having a full heap, shoul I
> use bigger value, why for ?)
> >
> > rpc_server_type: sync (I tried hsha and had the "ERROR 12:02:18,971 Read
> an invalid frame size of 0. Are you using TFramedTransport on the client
> side?" error). No idea how to fix this, and I use 5 different clients for
> different purpose  (Hector, Cassie, phpCassa, Astyanax, Helenus)...
> >
> > multithreaded_compaction: false (Should I try enabling this since I now
> use SSD ?)
> > compaction_throughput_mb_per_sec: 16 (I will definitely up this to 32 or
> even more)
> >
> > cross_node_timeout: true
> > endpoint_snitch: Ec2MultiRegionSnitch
> >
> > index_interval: 512
> >
> > cassandra-env.sh
> >
> > I am not sure about how to tune the heap, so I mainly use defaults
> >
> > MAX_HEAP_SIZE="8G"
> > HEAP_NEWSIZE="400M" (I tried with higher values, and it produced bigger
> GC times (1600 ms instead of < 200 ms now with 400M)
> >
> > -XX:+UseParNewGC
> > -XX:+UseConcMarkSweepGC
> > -XX:+CMSParallelRemarkEnabled
> > -XX:SurvivorRatio=8
> > -XX:MaxTenuringThreshold=1
> > -XX:CMSInitiatingOccupancyFraction=70
> > -XX:+UseCMSInitiatingOccupancyOnly
> >
> > Does this configuration seems coherent ? Right now, performance are
> correct, latency < 5ms almost all the time. What can I do to handle more
> data per node and keep these performances or get even better once ?
> >
> > I know this is a long message but if you have any comment or insight
> even on part of it, don't hesitate to share it. I guess this kind of
> comment on configuration is usable by the entire community.
> >
> > Alain
> >
>
>


-- 

  Mike Heffner <mi...@librato.com>
  Librato, Inc.

Re: High performance hardware with lot of data per node - Global learning about configuration

Posted by Aiman Parvaiz <ai...@grapheffect.com>.
Hi,
We also recently migrated to 3 hi.4xlarge boxes(Raid0 SSD) and the disk IO performance is definitely better than the earlier non SSD servers, we are serving up to 14k reads/s with a latency of 3-3.5 ms/op. 
I wanted to share our config options and ask about the data back up strategy for Raid0.

We are using C* 1.2.6 with

key_chache and row_cache of 300MB
I have not changed/ modified any other parameter except for going with multithreaded GC. I will be playing around with other factors and update everyone if I find something interesting.

Also, just wanted to share backup strategy and see if I can get something useful from how others are taking backup of their raid0. I am using tablesnap to upload SSTables to s3 and I have attached a separate EBS volume to every box and have set up rsync to mirror Cassandra data from Raid0 to EBS. I would really appreciate if you guys can share how you taking backups.

Thanks 


On Jul 9, 2013, at 7:11 AM, Alain RODRIGUEZ <ar...@gmail.com> wrote:

> Hi,
> 
> Using C*1.2.2.
> 
> We recently dropped our 18 m1.xLarge (4CPU, 15GB RAM, 4 Raid-0 Disks) servers to get 3 hi1.4xLarge (16CPU, 60GB RAM, 2 Raid-0 SSD) servers instead, for about the same price.
> 
> We tried it after reading some benchmark published by Netflix.
> 
> It is awesome and I recommend it to anyone who is using more than 18 xLarge server or can afford these high cost / high performance EC2 instances. SSD gives a very good throughput with an awesome latency.
> 
> Yet, we had about 200 GB data per server and now about 1 TB.
> 
> To alleviate memory pressure inside the heap I had to reduce the index sampling. I changed the index_interval value from 128 to 512, with no visible impact on latency, but a great improvement inside the heap which doesn't complain about any pressure anymore.
> 
> Is there some more tuning I could use, more tricks that could be useful while using big servers, with a lot of data per node and relatively high throughput ?
> 
> SSD are at 20-40 % of their throughput capacity (according to OpsCenter), CPU almost never reach a bigger load than 5 or 6 (with 16 CPU), 15 GB RAM used out of 60GB.
> 
> At this point I have kept my previous configuration, which is almost the default one from the Datastax community AMI. There is a part of it, you can consider that any property that is not in here is configured as default :
> 
> cassandra.yaml
> 
> key_cache_size_in_mb: (empty) - so default - 100MB (hit rate between 88 % and 92 %, good enough ?)
> row_cache_size_in_mb: 0 (not usable in our use case, a lot of different and random reads)
> flush_largest_memtables_at: 0.80
> reduce_cache_sizes_at: 0.90
> 
> concurrent_reads: 32 (I am thinking to increase this to 64 or more since I have just a few servers to handle more concurrence)
> concurrent_writes: 32 (I am thinking to increase this to 64 or more too)
> memtable_total_space_in_mb: 1024 (to avoid having a full heap, shoul I use bigger value, why for ?)
> 
> rpc_server_type: sync (I tried hsha and had the "ERROR 12:02:18,971 Read an invalid frame size of 0. Are you using TFramedTransport on the client side?" error). No idea how to fix this, and I use 5 different clients for different purpose  (Hector, Cassie, phpCassa, Astyanax, Helenus)...
> 
> multithreaded_compaction: false (Should I try enabling this since I now use SSD ?)
> compaction_throughput_mb_per_sec: 16 (I will definitely up this to 32 or even more)
> 
> cross_node_timeout: true
> endpoint_snitch: Ec2MultiRegionSnitch
> 
> index_interval: 512
> 
> cassandra-env.sh
> 
> I am not sure about how to tune the heap, so I mainly use defaults
> 
> MAX_HEAP_SIZE="8G"
> HEAP_NEWSIZE="400M" (I tried with higher values, and it produced bigger GC times (1600 ms instead of < 200 ms now with 400M)
> 
> -XX:+UseParNewGC
> -XX:+UseConcMarkSweepGC
> -XX:+CMSParallelRemarkEnabled
> -XX:SurvivorRatio=8
> -XX:MaxTenuringThreshold=1
> -XX:CMSInitiatingOccupancyFraction=70
> -XX:+UseCMSInitiatingOccupancyOnly
> 
> Does this configuration seems coherent ? Right now, performance are correct, latency < 5ms almost all the time. What can I do to handle more data per node and keep these performances or get even better once ?
> 
> I know this is a long message but if you have any comment or insight even on part of it, don't hesitate to share it. I guess this kind of comment on configuration is usable by the entire community.
> 
> Alain
>