You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Ran Tavory <ra...@gmail.com> on 2010/05/04 08:20:47 UTC

performance tuning - where does the slowness come from?

I'm looking into performance issues on a 0.6.1 cluster. I see two symptoms:
1. Reads and writes are slow
2. One of the hosts is doing a lot of GC.

1 is slow in the sense that in normal state the cluster used to make around
3-5k read and writes per second (6-10k operations per second), but how it's
in the order of 200-400 ops per second, sometimes even less.
2 looks like this:
$ tail -f /outbrain/cassandra/log/system.log
 INFO [GC inspection] 2010-05-04 00:42:18,636 GCInspector.java (line 110) GC
for ParNew: 672 ms, 166482384 reclaimed leaving 2872087208 used; max is
4432068608
 INFO [GC inspection] 2010-05-04 00:42:28,638 GCInspector.java (line 110) GC
for ParNew: 498 ms, 166493352 reclaimed leaving 2836049448 used; max is
4432068608
 INFO [GC inspection] 2010-05-04 00:42:38,640 GCInspector.java (line 110) GC
for ParNew: 327 ms, 166091528 reclaimed leaving 2796888424 used; max is
4432068608
... and it goes on and on for hours, no stopping...

The cluster is made of 6 hosts, 3 in one DC and 3 in another.
Each host has 8G RAM.
-Xmx=4G

For some reason, the load isn't distributed evenly b/w the hosts, although
I'm not sure this is the cause for slowness
$ nodetool -h localhost -p 9004 ring
Address       Status     Load          Range
     Ring

144413773383729447702215082383444206680
192.168.252.99Up         15.94 GB
 66002764663998929243644931915471302076     |<--|
192.168.254.57Up         19.84 GB
 81288739225600737067856268063987022738     |   ^
192.168.254.58Up         973.78 MB
86999744104066390588161689990810839743     v   |
192.168.252.62Up         5.18 GB
88308919879653155454332084719458267849     |   ^
192.168.254.59Up         10.57 GB
 142482163220375328195837946953175033937    v   |
192.168.252.61Up         11.36 GB
 144413773383729447702215082383444206680    |-->|

The slow host is 192.168.252.61 and it isn't the most loaded one.

The host is waiting a lot on IO and the load average is usually 6-7
$ w
 00:42:56 up 11 days, 13:22,  1 user,  load average: 6.21, 5.52, 3.93

$ vmstat 5
procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id
wa st
 0  8 2147844  45744   1816 4457384    6    5    66    32    5    2  1  1 96
 2  0
 0  8 2147164  49020   1808 4451596  385    0  2345    58 3372 9957  2  2 78
18  0
 0  3 2146432  45704   1812 4453956  342    0  2274   108 3937 10732  2  2
78 19  0
 0  1 2146252  44696   1804 4453436  345  164  1939   294 3647 7833  2  2 78
18  0
 0  1 2145960  46924   1744 4451260  158    0  2423   122 4354 14597  2  2
77 18  0
 7  1 2138344  44676    952 4504148 1722  403  1722   406 1388  439 87  0 10
 2  0
 7  2 2137248  45652    956 4499436 1384  655  1384   658 1356  392 87  0 10
 3  0
 7  1 2135976  46764    956 4495020 1366  718  1366   718 1395  380 87  0  9
 4  0
 0  8 2134484  46964    956 4489420 1673  555  1814   586 1601 215590 14  2
68 16  0
 0  1 2135388  47444    972 4488516  785  833  2390   995 3812 8305  2  2 77
20  0
 0 10 2135164  45928    980 4488796  788  543  2275   626 36

So, the host is swapping like crazy...

top shows that it's using a lot of memory. As noted before -Xmx=4G and
nothing else seems to be using a lot of memory on the host except for the
cassandra process, however, of the 8G ram on the host, 92% is used by
cassandra. How's that?
Top shows there's 3.9g Shared and 7.2g Resident and *15.9g Virtual*. Why
does it have 15g virtual? And why 7.2 RES? This can explain the slowness in
swapping.

$ top
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND


20281 cassandr  25   0 15.9g 7.2g 3.9g S 33.3 92.6 175:30.27 java

So, can the total memory be controlled?
Or perhaps I'm looking in the wrong direction...

I've looked at all the cassandra JMX counts and nothing seemed suspicious so
far. By suspicious i mean a large number of pending tasks - there were
always very small numbers in each pool.
About read and write latencies, I'm not sure what the normal state is, but
here's an example of what I see on the problematic host:

#mbean = org.apache.cassandra.service:type=StorageProxy:
RecentReadLatencyMicros = 30105.888180684495;
TotalReadLatencyMicros = 78543052801;
TotalWriteLatencyMicros = 4213118609;
RecentWriteLatencyMicros = 1444.4809201925639;
ReadOperations = 4779553;
RangeOperations = 0;
TotalRangeLatencyMicros = 0;
RecentRangeLatencyMicros = NaN;
WriteOperations = 4740093;

And the only pool that I do see some pending tasks is the ROW-READ-STAGE,
but it doesn't look like much, usually around 6-8:
#mbean = org.apache.cassandra.concurrent:type=ROW-READ-STAGE:
ActiveCount = 8;
PendingTasks = 8;
CompletedTasks = 5427955;

Any help finding the solution is appreciated, thanks...

Below are a few more JMXes I collected from the system that may be
interesting.

#mbean = java.lang:type=Memory:
Verbose = false;

HeapMemoryUsage = {
  committed = 3767279616;
  init = 134217728;
  max = 4293656576;
  used = 1237105080;
 };

NonHeapMemoryUsage = {
  committed = 35061760;
  init = 24313856;
  max = 138412032;
  used = 23151320;
 };

ObjectPendingFinalizationCount = 0;

#mbean = java.lang:name=ParNew,type=GarbageCollector:
LastGcInfo = {
  GcThreadCount = 11;
  duration = 136;
  endTime = 42219272;
  id = 11719;
  memoryUsageAfterGc = {
    ( CMS Perm Gen ) = {
      key = CMS Perm Gen;
      value = {
        committed = 29229056;
        init = 21757952;
        max = 88080384;
        used = 17648848;
       };
     };
    ( Code Cache ) = {
      key = Code Cache;
      value = {
        committed = 5832704;
        init = 2555904;
        max = 50331648;
        used = 5563520;
       };
     };
    ( CMS Old Gen ) = {
      key = CMS Old Gen;
      value = {
        committed = 3594133504;
        init = 112459776;
        max = 4120510464;
        used = 964565720;
       };
     };
    ( Par Eden Space ) = {
      key = Par Eden Space;
      value = {
        committed = 171835392;
        init = 21495808;
        max = 171835392;
        used = 0;
       };
     };
    ( Par Survivor Space ) = {
      key = Par Survivor Space;
      value = {
        committed = 1310720;
        init = 131072;
        max = 1310720;
        used = 0;
       };
     };
   };
  memoryUsageBeforeGc = {
    ( CMS Perm Gen ) = {
      key = CMS Perm Gen;
      value = {
        committed = 29229056;
        init = 21757952;
        max = 88080384;
        used = 17648848;
       };
     };
    ( Code Cache ) = {
      key = Code Cache;
      value = {
        committed = 5832704;
        init = 2555904;
        max = 50331648;
        used = 5563520;
       };
     };
    ( CMS Old Gen ) = {
      key = CMS Old Gen;
      value = {
        committed = 3594133504;
        init = 112459776;
        max = 4120510464;
        used = 959221872;
       };
     };
    ( Par Eden Space ) = {
      key = Par Eden Space;
      value = {
        committed = 171835392;
        init = 21495808;
        max = 171835392;
        used = 171835392;
       };
     };
    ( Par Survivor Space ) = {
      key = Par Survivor Space;
      value = {
        committed = 1310720;
        init = 131072;
        max = 1310720;
        used = 0;
       };
     };
   };
  startTime = 42219136;
 };
CollectionCount = 11720;
CollectionTime = 4561730;
Name = ParNew;
Valid = true;
MemoryPoolNames = [ Par Eden Space, Par Survivor Space ];

#mbean = java.lang:type=OperatingSystem:
MaxFileDescriptorCount = 63536;
OpenFileDescriptorCount = 75;
CommittedVirtualMemorySize = 17787711488;
FreePhysicalMemorySize = 45522944;
FreeSwapSpaceSize = 2123968512;
ProcessCpuTime = 12251460000000;
TotalPhysicalMemorySize = 8364417024;
TotalSwapSpaceSize = 4294959104;
Name = Linux;
AvailableProcessors = 8;
Arch = amd64;
SystemLoadAverage = 4.36;
Version = 2.6.18-164.15.1.el5;

#mbean = java.lang:type=Runtime:
Name = 20281@ob1061.nydc1.outbrain.com;

ClassPath =
/outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../build/classes:/outbrain/cassandra/apache-cassandra-0.6.1/bin/..
/lib/antlr-3.1.3.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/apache-cassandra-0.6.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/avro-1.2.0-dev.jar:/outb
rain/cassandra/apache-cassandra-0.6.1/bin/../lib/clhm-production.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-cli-1.1.jar:/outbrain/cassandra/apache-cassandra-
0.6.1/bin/../lib/commons-codec-1.2.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-collections-3.2.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/com
mons-lang-2.4.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/google-collections-1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/hadoop-core-0.20.1.jar:/out
brain/cassandra/apache-cassandra-0.6.1/bin/../lib/high-scale-lib.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/ivy-2.1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/
bin/../lib/jackson-core-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jackson-mapper-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jline
-0.9.94.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/json-simple-1.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/libthrift-r917130.jar:/outbrain/cassandr
a/apache-cassandra-0.6.1/bin/../lib/log4j-1.2.14.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/slf4j-api-1.5.8.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib
/slf4j-log4j12-1.5.8.jar;

BootClassPath =
/usr/java/jdk1.6.0_17/jre/lib/alt-rt.jar:/usr/java/jdk1.6.0_17/jre/lib/resources.jar:/usr/java/jdk1.6.0_17/jre/lib/rt.jar:/usr/java/jdk1.6.0_17/jre/lib/sunrsasign.j
ar:/usr/java/jdk1.6.0_17/jre/lib/jsse.jar:/usr/java/jdk1.6.0_17/jre/lib/jce.jar:/usr/java/jdk1.6.0_17/jre/lib/charsets.jar:/usr/java/jdk1.6.0_17/jre/classes;

LibraryPath =
/usr/java/jdk1.6.0_17/jre/lib/amd64/server:/usr/java/jdk1.6.0_17/jre/lib/amd64:/usr/java/jdk1.6.0_17/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib;

VmName = Java HotSpot(TM) 64-Bit Server VM;

VmVendor = Sun Microsystems Inc.;

VmVersion = 14.3-b01;

BootClassPathSupported = true;

InputArguments = [ -ea, -Xms128M, -Xmx4G, -XX:TargetSurvivorRatio=90,
-XX:+AggressiveOpts, -XX:+UseParNewGC, -XX:+UseConcMarkSweepGC,
-XX:+CMSParallelRemarkEnabled, -XX:+HeapDumpOnOutOfMemoryError,
-XX:SurvivorRatio=128, -XX:MaxTenuringThreshold=0,
-Dcom.sun.management.jmxremote.port=9004,
-Dcom.sun.management.jmxremote.ssl=false,
-Dcom.sun.management.jmxremote.authenticate=false,
-Dstorage-config=/outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf,
-Dcassandra-pidfile=/var/run/cassandra.pid ];

ManagementSpecVersion = 1.2;

SpecName = Java Virtual Machine Specification;

SpecVendor = Sun Microsystems Inc.;

SpecVersion = 1.0;

StartTime = 1272911001415;
...

Re: performance tuning - where does the slowness come from?

Posted by Ran Tavory <ra...@gmail.com>.

I haven't tried it.

On May 6, 2010 1:22 AM, "Mark Greene" <gr...@gmail.com> wrote:

Ran,

Did you find differing results from stress.py?

-Mark



On Wed, May 5, 2010 at 5:59 PM, Ran Tavory <ra...@gmail.com> wrote:
>
> let's see if I can make s...

Re: performance tuning - where does the slowness come from?

Posted by Mark Greene <gr...@gmail.com>.

Ran,

Did you find differing results from stress.py?

-Mark

On Wed, May 5, 2010 at 5:59 PM, Ran Tavory <ra...@gmail.com> wrote:

> let's see if I can make some assertions, feel free to correct me...
>
> Well, obviously, reads are much slower in cassandra than writes, everyone
> knows that, but by which factor?
> In my case I read/write only one column at a time. Key, column and value
> are pretty small (< 200b)
> So the numbers are usually - Write Latency: ~0.05ms, Read Latency: ~30ms.
> So it looks like reads are 60x slower, at least on my hardware. This happens
> when cache is cold. If cache is warm reads are better, but unfortunately my
> cache is usually cold...
> If my application keeps reading a cold cache there's nothing I can do to
> make reads faster from cassandra's side. With one client thread this implies
> 1000/30=33 reads/sec, not great.
> However, although read latency is a bottleneck, read throughput isn't. So I
> need to add more reading threads and I can actually add many of them before
> read latency starts draining, according to ycsb I can have ~10000 reads/sec
> (on the cluster they tested, numbers may vary) before read latency starts
> draining. So, numbers may vary by cluster size, hardware, data size etc, but
> the idea is - if read latency is 30ms and cache is a miss most of the time,
> that's normal, just add more reader threads and you get a better throughput.
> Sorry if this sounds trivial, I was just trying to improve on the 30ms
> reads until I realized I actually can't...
>
> On Wed, May 5, 2010 at 7:08 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>
>>  - your key cache isn't warm.  capacity 17M, size 0.5M, 468083 reads
>> sounds like most of your reads have been for unique keys.
>>  - the kind of reads you are doing can have a big effect (mostly
>> number of columns you are asking for).  column index granularity plays
>> a role (for non-rowcached reads); so can column comparator (see e.g.
>> https://issues.apache.org/jira/browse/CASSANDRA-1043)
>>  - the slow system reads are all on HH rows, which can get very wide
>> (hence, slow to read the whole row, which is what the HH code does).
>> clean those out either by bringing back the nodes it's hinting for, or
>> just removing the HH data files.
>>
>> On Wed, May 5, 2010 at 10:19 AM, Ran Tavory <ra...@gmail.com> wrote:
>> > I'm still trying to figure out where my slowness is coming from...
>> > By now I'm pretty sure it's the reads are slow, but not sure how to
>> improve
>> > them.
>> > I'm looking at cfstats. Can you say if there are better configuration
>> > options? So far I've used all default settings, except for:
>> >     <Keyspace Name="outbrain_kvdb">
>> >       <ColumnFamily CompareWith="BytesType" Name="KvImpressions"
>> > KeysCached="50%"/>
>> >
>> >
>>  <ReplicaPlacementStrategy>org.apache.cassandra.locator.RackAwareStrategy</ReplicaPlacementStrategy>
>> >       <ReplicationFactor>2</ReplicationFactor>
>> >
>> >
>>  <EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch>
>> >     </Keyspace>
>> >
>> > What does a good read latency look like? I was expecting 10ms, however
>> so
>> > far it seems that my KvImpressions read latency is 30ms and in the
>> system
>> > keyspace I have 800ms :(
>> > I thought adding KeysCached="50%" would improve my situation but
>> > unfortunately looks like the hitrate is about 0. I realize that's
>> > application specific, but maybe there are other magic bullets...
>> > Is there something like adding cache to the system keyspace? 800 ms is
>> > pretty bad, isn't it?
>> > See stats below and thanks.
>> >
>> > Keyspace: outbrain_kvdb
>> >         Read Count: 651668
>> >         Read Latency: 34.18622328547666 ms.
>> >         Write Count: 655542
>> >         Write Latency: 0.041145092152752985 ms.
>> >         Pending Tasks: 0
>> >                 Column Family: KvImpressions
>> >                 SSTable count: 13
>> >                 Space used (live): 23304548897
>> >                 Space used (total): 23304548897
>> >                 Memtable Columns Count: 895
>> >                 Memtable Data Size: 2108990
>> >                 Memtable Switch Count: 8
>> >                 Read Count: 468083
>> >                 Read Latency: 151.603 ms.
>> >                 Write Count: 552566
>> >                 Write Latency: 0.023 ms.
>> >                 Pending Tasks: 0
>> >                 Key cache capacity: 17398656
>> >                 Key cache size: 567967
>> >                 Key cache hit rate: 0.0
>> >                 Row cache: disabled
>> >                 Compacted row minimum size: 269
>> >                 Compacted row maximum size: 54501
>> >                 Compacted row mean size: 933
>> > ...
>> > ----------------
>> > Keyspace: system
>> >         Read Count: 1151
>> >         Read Latency: 872.5014448305822 ms.
>> >         Write Count: 51215
>> >         Write Latency: 0.07156788050375866 ms.
>> >         Pending Tasks: 0
>> >                 Column Family: HintsColumnFamily
>> >                 SSTable count: 5
>> >                 Space used (live): 437366878
>> >                 Space used (total): 437366878
>> >                 Memtable Columns Count: 14987
>> >                 Memtable Data Size: 87975
>> >                 Memtable Switch Count: 2
>> >                 Read Count: 1150
>> >                 Read Latency: NaN ms.
>> >                 Write Count: 51211
>> >                 Write Latency: 0.027 ms.
>> >                 Pending Tasks: 0
>> >                 Key cache capacity: 6
>> >                 Key cache size: 4
>> >                 Key cache hit rate: NaN
>> >                 Row cache: disabled
>> >                 Compacted row minimum size: 0
>> >                 Compacted row maximum size: 0
>> >                 Compacted row mean size: 0
>> >                 Column Family: LocationInfo
>> >                 SSTable count: 2
>> >                 Space used (live): 3504
>> >                 Space used (total): 3504
>> >                 Memtable Columns Count: 0
>> >                 Memtable Data Size: 0
>> >                 Memtable Switch Count: 1
>> >                 Read Count: 1
>> >                 Read Latency: NaN ms.
>> >                 Write Count: 7
>> >                 Write Latency: NaN ms.
>> >                 Pending Tasks: 0
>> >                 Key cache capacity: 2
>> >                 Key cache size: 1
>> >                 Key cache hit rate: NaN
>> >                 Row cache: disabled
>> >                 Compacted row minimum size: 0
>> >                 Compacted row maximum size: 0
>> >                 Compacted row mean size: 0
>> >
>> > On Tue, May 4, 2010 at 10:57 PM, Kyusik Chung <ky...@discovereads.com>
>> > wrote:
>> >>
>> >> Im using Ubuntu 8.04 on 64 bit hosts on rackspace cloud.
>> >>
>> >> Im in the middle of repeating some perf tests, but so far, I get
>> as-good
>> >> or slightly better read perf by using standard disk access mode vs
>> mmap.  So
>> >> far consecutive tests are returning consistent numbers.
>> >>
>> >> Im not sure how to explain it...maybe its an ubuntu 8.04 issue with
>> mmap.
>> >>  Back when I was using mmap, I was definitely seeing the kswapd0
>> process
>> >> start using cpu as the box ran out of memory, and read performance
>> >> significantly degraded.
>> >>
>> >> Next, Ill run some tests with mmap_index_only, and Ill test with heavy
>> >> concurrent writes as well as reads.  Ill let everyone know what I find.
>> >>
>> >> Kyusik Chung
>> >> CEO, Discovereads.com
>> >> kyusik@discovereads.com
>> >>
>> >> On May 4, 2010, at 12:27 PM, Jonathan Ellis wrote:
>> >>
>> >> > Are you using 32 bit hosts?  If not don't be scared of mmap using a
>> >> > lot of address space, you have plenty.  It won't make you swap more
>> >> > than using buffered i/o.
>> >> >
>> >> > On Tue, May 4, 2010 at 1:57 PM, Ran Tavory <ra...@gmail.com> wrote:
>> >> >> I canceled mmap and indeed memory usage is sane again. So far
>> >> >> performance
>> >> >> hasn't been great, but I'll wait and see.
>> >> >> I'm also interested in a way to cap mmap so I can take advantage of
>> it
>> >> >> but
>> >> >> not swap the host to death...
>> >> >>
>> >> >> On Tue, May 4, 2010 at 9:38 PM, Kyusik Chung <
>> kyusik@discovereads.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> This sounds just like the slowness I was asking about in another
>> >> >>> thread -
>> >> >>> after a lot of reads, the machine uses up all available memory on
>> the
>> >> >>> box
>> >> >>> and then starts swapping.
>> >> >>> My understanding was that mmap helps greatly with read and write
>> perf
>> >> >>> (until the box starts swapping I guess)...is there any way to use
>> mmap
>> >> >>> and
>> >> >>> cap how much memory it takes up?
>> >> >>> What do people use in production?  mmap or no mmap?
>> >> >>> Thanks!
>> >> >>> Kyusik Chung
>> >> >>> On May 4, 2010, at 10:11 AM, Schubert Zhang wrote:
>> >> >>>
>> >> >>> 1. When initially startup your nodes, please plan your InitialToken
>> of
>> >> >>> each node evenly.
>> >> >>> 2. <DiskAccessMode>standard</DiskAccessMode>
>> >> >>>
>> >> >>> On Tue, May 4, 2010 at 9:09 PM, Boris Shulman <sh...@gmail.com>
>> >> >>> wrote:
>> >> >>>>
>> >> >>>> I think that the extra (more than 4GB) memory usage comes from the
>> >> >>>> mmaped io, that is why it happens only for reads.
>> >> >>>>
>> >> >>>> On Tue, May 4, 2010 at 2:02 PM, Jordan Pittier
>> >> >>>> <jo...@gmail.com>
>> >> >>>> wrote:
>> >> >>>>> I'm facing the same issue with swap. It only occurs when I
>> perform
>> >> >>>>> read
>> >> >>>>> operations (write are very fast :)). So I can't help you with the
>> >> >>>>> memory
>> >> >>>>> probleme.
>> >> >>>>>
>> >> >>>>> But to balance the load evenly between nodes in cluster just
>> >> >>>>> manually
>> >> >>>>> fix
>> >> >>>>> their token.(the "formula" is i * 2^127 / nb_nodes).
>> >> >>>>>
>> >> >>>>> Jordzn
>> >> >>>>>
>> >> >>>>> On Tue, May 4, 2010 at 8:20 AM, Ran Tavory <ra...@gmail.com>
>> wrote:
>> >> >>>>>>
>> >> >>>>>> I'm looking into performance issues on a 0.6.1 cluster. I see
>> two
>> >> >>>>>> symptoms:
>> >> >>>>>> 1. Reads and writes are slow
>> >> >>>>>> 2. One of the hosts is doing a lot of GC.
>> >> >>>>>> 1 is slow in the sense that in normal state the cluster used to
>> >> >>>>>> make
>> >> >>>>>> around 3-5k read and writes per second (6-10k operations per
>> >> >>>>>> second),
>> >> >>>>>> but
>> >> >>>>>> how it's in the order of 200-400 ops per second, sometimes even
>> >> >>>>>> less.
>> >> >>>>>> 2 looks like this:
>> >> >>>>>> $ tail -f /outbrain/cassandra/log/system.log
>> >> >>>>>>  INFO [GC inspection] 2010-05-04 00:42:18,636 GCInspector.java
>> >> >>>>>> (line
>> >> >>>>>> 110)
>> >> >>>>>> GC for ParNew: 672 ms, 166482384 reclaimed leaving 2872087208
>> used;
>> >> >>>>>> max is
>> >> >>>>>> 4432068608
>> >> >>>>>>  INFO [GC inspection] 2010-05-04 00:42:28,638 GCInspector.java
>> >> >>>>>> (line
>> >> >>>>>> 110)
>> >> >>>>>> GC for ParNew: 498 ms, 166493352 reclaimed leaving 2836049448
>> used;
>> >> >>>>>> max is
>> >> >>>>>> 4432068608
>> >> >>>>>>  INFO [GC inspection] 2010-05-04 00:42:38,640 GCInspector.java
>> >> >>>>>> (line
>> >> >>>>>> 110)
>> >> >>>>>> GC for ParNew: 327 ms, 166091528 reclaimed leaving 2796888424
>> used;
>> >> >>>>>> max is
>> >> >>>>>> 4432068608
>> >> >>>>>> ... and it goes on and on for hours, no stopping...
>> >> >>>>>> The cluster is made of 6 hosts, 3 in one DC and 3 in another.
>> >> >>>>>> Each host has 8G RAM.
>> >> >>>>>> -Xmx=4G
>> >> >>>>>> For some reason, the load isn't distributed evenly b/w the
>> hosts,
>> >> >>>>>> although
>> >> >>>>>> I'm not sure this is the cause for slowness
>> >> >>>>>> $ nodetool -h localhost -p 9004 ring
>> >> >>>>>> Address       Status     Load          Range
>> >> >>>>>>        Ring
>> >> >>>>>>
>> >> >>>>>> 144413773383729447702215082383444206680
>> >> >>>>>> 192.168.252.99Up         15.94 GB
>> >> >>>>>>  66002764663998929243644931915471302076     |<--|
>> >> >>>>>> 192.168.254.57Up         19.84 GB
>> >> >>>>>>  81288739225600737067856268063987022738     |   ^
>> >> >>>>>> 192.168.254.58Up         973.78 MB
>> >> >>>>>> 86999744104066390588161689990810839743     v   |
>> >> >>>>>> 192.168.252.62Up         5.18 GB
>> >> >>>>>> 88308919879653155454332084719458267849     |   ^
>> >> >>>>>> 192.168.254.59Up         10.57 GB
>> >> >>>>>>  142482163220375328195837946953175033937    v   |
>> >> >>>>>> 192.168.252.61Up         11.36 GB
>> >> >>>>>>  144413773383729447702215082383444206680    |-->|
>> >> >>>>>> The slow host is 192.168.252.61 and it isn't the most loaded
>> one.
>> >> >>>>>> The host is waiting a lot on IO and the load average is usually
>> 6-7
>> >> >>>>>> $ w
>> >> >>>>>>  00:42:56 up 11 days, 13:22,  1 user,  load average: 6.21, 5.52,
>> >> >>>>>> 3.93
>> >> >>>>>> $ vmstat 5
>> >> >>>>>> procs -----------memory---------- ---swap-- -----io----
>> --system--
>> >> >>>>>> -----cpu------
>> >> >>>>>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in
>> cs
>> >> >>>>>> us
>> >> >>>>>> sy id
>> >> >>>>>> wa st
>> >> >>>>>>  0  8 2147844  45744   1816 4457384    6    5    66    32    5
>>  2
>> >> >>>>>>  1
>> >> >>>>>>  1
>> >> >>>>>> 96  2  0
>> >> >>>>>>  0  8 2147164  49020   1808 4451596  385    0  2345    58 3372
>> 9957
>> >> >>>>>>  2
>> >> >>>>>>  2
>> >> >>>>>> 78 18  0
>> >> >>>>>>  0  3 2146432  45704   1812 4453956  342    0  2274   108 3937
>> >> >>>>>> 10732
>> >> >>>>>>  2  2
>> >> >>>>>> 78 19  0
>> >> >>>>>>  0  1 2146252  44696   1804 4453436  345  164  1939   294 3647
>> 7833
>> >> >>>>>>  2
>> >> >>>>>>  2
>> >> >>>>>> 78 18  0
>> >> >>>>>>  0  1 2145960  46924   1744 4451260  158    0  2423   122 4354
>> >> >>>>>> 14597
>> >> >>>>>>  2  2
>> >> >>>>>> 77 18  0
>> >> >>>>>>  7  1 2138344  44676    952 4504148 1722  403  1722   406 1388
>>  439
>> >> >>>>>> 87
>> >> >>>>>>  0
>> >> >>>>>> 10  2  0
>> >> >>>>>>  7  2 2137248  45652    956 4499436 1384  655  1384   658 1356
>>  392
>> >> >>>>>> 87
>> >> >>>>>>  0
>> >> >>>>>> 10  3  0
>> >> >>>>>>  7  1 2135976  46764    956 4495020 1366  718  1366   718 1395
>>  380
>> >> >>>>>> 87
>> >> >>>>>>  0
>> >> >>>>>>  9  4  0
>> >> >>>>>>  0  8 2134484  46964    956 4489420 1673  555  1814   586 1601
>> >> >>>>>> 215590
>> >> >>>>>> 14
>> >> >>>>>>  2 68 16  0
>> >> >>>>>>  0  1 2135388  47444    972 4488516  785  833  2390   995 3812
>> 8305
>> >> >>>>>>  2
>> >> >>>>>>  2
>> >> >>>>>> 77 20  0
>> >> >>>>>>  0 10 2135164  45928    980 4488796  788  543  2275   626 36
>> >> >>>>>> So, the host is swapping like crazy...
>> >> >>>>>> top shows that it's using a lot of memory. As noted before
>> -Xmx=4G
>> >> >>>>>> and
>> >> >>>>>> nothing else seems to be using a lot of memory on the host
>> except
>> >> >>>>>> for
>> >> >>>>>> the
>> >> >>>>>> cassandra process, however, of the 8G ram on the host, 92% is
>> used
>> >> >>>>>> by
>> >> >>>>>> cassandra. How's that?
>> >> >>>>>> Top shows there's 3.9g Shared and 7.2g Resident and 15.9g
>> Virtual.
>> >> >>>>>> Why
>> >> >>>>>> does it have 15g virtual? And why 7.2 RES? This can explain the
>> >> >>>>>> slowness in
>> >> >>>>>> swapping.
>> >> >>>>>> $ top
>> >> >>>>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
>> >> >>>>>>  COMMAND
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> 20281 cassandr  25   0 15.9g 7.2g 3.9g S 33.3 92.6 175:30.27
>> java
>> >> >>>>>> So, can the total memory be controlled?
>> >> >>>>>> Or perhaps I'm looking in the wrong direction...
>> >> >>>>>> I've looked at all the cassandra JMX counts and nothing seemed
>> >> >>>>>> suspicious
>> >> >>>>>> so far. By suspicious i mean a large number of pending tasks -
>> >> >>>>>> there
>> >> >>>>>> were
>> >> >>>>>> always very small numbers in each pool.
>> >> >>>>>> About read and write latencies, I'm not sure what the normal
>> state
>> >> >>>>>> is,
>> >> >>>>>> but
>> >> >>>>>> here's an example of what I see on the problematic host:
>> >> >>>>>> #mbean = org.apache.cassandra.service:type=StorageProxy:
>> >> >>>>>> RecentReadLatencyMicros = 30105.888180684495;
>> >> >>>>>> TotalReadLatencyMicros = 78543052801;
>> >> >>>>>> TotalWriteLatencyMicros = 4213118609;
>> >> >>>>>> RecentWriteLatencyMicros = 1444.4809201925639;
>> >> >>>>>> ReadOperations = 4779553;
>> >> >>>>>> RangeOperations = 0;
>> >> >>>>>> TotalRangeLatencyMicros = 0;
>> >> >>>>>> RecentRangeLatencyMicros = NaN;
>> >> >>>>>> WriteOperations = 4740093;
>> >> >>>>>> And the only pool that I do see some pending tasks is the
>> >> >>>>>> ROW-READ-STAGE,
>> >> >>>>>> but it doesn't look like much, usually around 6-8:
>> >> >>>>>> #mbean = org.apache.cassandra.concurrent:type=ROW-READ-STAGE:
>> >> >>>>>> ActiveCount = 8;
>> >> >>>>>> PendingTasks = 8;
>> >> >>>>>> CompletedTasks = 5427955;
>> >> >>>>>> Any help finding the solution is appreciated, thanks...
>> >> >>>>>> Below are a few more JMXes I collected from the system that may
>> be
>> >> >>>>>> interesting.
>> >> >>>>>> #mbean = java.lang:type=Memory:
>> >> >>>>>> Verbose = false;
>> >> >>>>>> HeapMemoryUsage = {
>> >> >>>>>>   committed = 3767279616;
>> >> >>>>>>   init = 134217728;
>> >> >>>>>>   max = 4293656576;
>> >> >>>>>>   used = 1237105080;
>> >> >>>>>>  };
>> >> >>>>>> NonHeapMemoryUsage = {
>> >> >>>>>>   committed = 35061760;
>> >> >>>>>>   init = 24313856;
>> >> >>>>>>   max = 138412032;
>> >> >>>>>>   used = 23151320;
>> >> >>>>>>  };
>> >> >>>>>> ObjectPendingFinalizationCount = 0;
>> >> >>>>>> #mbean = java.lang:name=ParNew,type=GarbageCollector:
>> >> >>>>>> LastGcInfo = {
>> >> >>>>>>   GcThreadCount = 11;
>> >> >>>>>>   duration = 136;
>> >> >>>>>>   endTime = 42219272;
>> >> >>>>>>   id = 11719;
>> >> >>>>>>   memoryUsageAfterGc = {
>> >> >>>>>>     ( CMS Perm Gen ) = {
>> >> >>>>>>       key = CMS Perm Gen;
>> >> >>>>>>       value = {
>> >> >>>>>>         committed = 29229056;
>> >> >>>>>>         init = 21757952;
>> >> >>>>>>         max = 88080384;
>> >> >>>>>>         used = 17648848;
>> >> >>>>>>        };
>> >> >>>>>>      };
>> >> >>>>>>     ( Code Cache ) = {
>> >> >>>>>>       key = Code Cache;
>> >> >>>>>>       value = {
>> >> >>>>>>         committed = 5832704;
>> >> >>>>>>         init = 2555904;
>> >> >>>>>>         max = 50331648;
>> >> >>>>>>         used = 5563520;
>> >> >>>>>>        };
>> >> >>>>>>      };
>> >> >>>>>>     ( CMS Old Gen ) = {
>> >> >>>>>>       key = CMS Old Gen;
>> >> >>>>>>       value = {
>> >> >>>>>>         committed = 3594133504;
>> >> >>>>>>         init = 112459776;
>> >> >>>>>>         max = 4120510464;
>> >> >>>>>>         used = 964565720;
>> >> >>>>>>        };
>> >> >>>>>>      };
>> >> >>>>>>     ( Par Eden Space ) = {
>> >> >>>>>>       key = Par Eden Space;
>> >> >>>>>>       value = {
>> >> >>>>>>         committed = 171835392;
>> >> >>>>>>         init = 21495808;
>> >> >>>>>>         max = 171835392;
>> >> >>>>>>         used = 0;
>> >> >>>>>>        };
>> >> >>>>>>      };
>> >> >>>>>>     ( Par Survivor Space ) = {
>> >> >>>>>>       key = Par Survivor Space;
>> >> >>>>>>       value = {
>> >> >>>>>>         committed = 1310720;
>> >> >>>>>>         init = 131072;
>> >> >>>>>>         max = 1310720;
>> >> >>>>>>         used = 0;
>> >> >>>>>>        };
>> >> >>>>>>      };
>> >> >>>>>>    };
>> >> >>>>>>   memoryUsageBeforeGc = {
>> >> >>>>>>     ( CMS Perm Gen ) = {
>> >> >>>>>>       key = CMS Perm Gen;
>> >> >>>>>>       value = {
>> >> >>>>>>         committed = 29229056;
>> >> >>>>>>         init = 21757952;
>> >> >>>>>>         max = 88080384;
>> >> >>>>>>         used = 17648848;
>> >> >>>>>>        };
>> >> >>>>>>      };
>> >> >>>>>>     ( Code Cache ) = {
>> >> >>>>>>       key = Code Cache;
>> >> >>>>>>       value = {
>> >> >>>>>>         committed = 5832704;
>> >> >>>>>>         init = 2555904;
>> >> >>>>>>         max = 50331648;
>> >> >>>>>>         used = 5563520;
>> >> >>>>>>        };
>> >> >>>>>>      };
>> >> >>>>>>     ( CMS Old Gen ) = {
>> >> >>>>>>       key = CMS Old Gen;
>> >> >>>>>>       value = {
>> >> >>>>>>         committed = 3594133504;
>> >> >>>>>>         init = 112459776;
>> >> >>>>>>         max = 4120510464;
>> >> >>>>>>         used = 959221872;
>> >> >>>>>>        };
>> >> >>>>>>      };
>> >> >>>>>>     ( Par Eden Space ) = {
>> >> >>>>>>       key = Par Eden Space;
>> >> >>>>>>       value = {
>> >> >>>>>>         committed = 171835392;
>> >> >>>>>>         init = 21495808;
>> >> >>>>>>         max = 171835392;
>> >> >>>>>>         used = 171835392;
>> >> >>>>>>        };
>> >> >>>>>>      };
>> >> >>>>>>     ( Par Survivor Space ) = {
>> >> >>>>>>       key = Par Survivor Space;
>> >> >>>>>>       value = {
>> >> >>>>>>         committed = 1310720;
>> >> >>>>>>         init = 131072;
>> >> >>>>>>         max = 1310720;
>> >> >>>>>>         used = 0;
>> >> >>>>>>        };
>> >> >>>>>>      };
>> >> >>>>>>    };
>> >> >>>>>>   startTime = 42219136;
>> >> >>>>>>  };
>> >> >>>>>> CollectionCount = 11720;
>> >> >>>>>> CollectionTime = 4561730;
>> >> >>>>>> Name = ParNew;
>> >> >>>>>> Valid = true;
>> >> >>>>>> MemoryPoolNames = [ Par Eden Space, Par Survivor Space ];
>> >> >>>>>> #mbean = java.lang:type=OperatingSystem:
>> >> >>>>>> MaxFileDescriptorCount = 63536;
>> >> >>>>>> OpenFileDescriptorCount = 75;
>> >> >>>>>> CommittedVirtualMemorySize = 17787711488;
>> >> >>>>>> FreePhysicalMemorySize = 45522944;
>> >> >>>>>> FreeSwapSpaceSize = 2123968512;
>> >> >>>>>> ProcessCpuTime = 12251460000000;
>> >> >>>>>> TotalPhysicalMemorySize = 8364417024;
>> >> >>>>>> TotalSwapSpaceSize = 4294959104;
>> >> >>>>>> Name = Linux;
>> >> >>>>>> AvailableProcessors = 8;
>> >> >>>>>> Arch = amd64;
>> >> >>>>>> SystemLoadAverage = 4.36;
>> >> >>>>>> Version = 2.6.18-164.15.1.el5;
>> >> >>>>>> #mbean = java.lang:type=Runtime:
>> >> >>>>>> Name = 20281@ob1061.nydc1.outbrain.com;
>> >> >>>>>>
>> >> >>>>>> ClassPath =
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> /outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../build/classes:/outbrain/cassandra/apache-cassandra-0.6.1/bin/..
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> /lib/antlr-3.1.3.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/apache-cassandra-0.6.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/avro-1.2.0-dev.jar:/outb
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> rain/cassandra/apache-cassandra-0.6.1/bin/../lib/clhm-production.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-cli-1.1.jar:/outbrain/cassandra/apache-cassandra-
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> 0.6.1/bin/../lib/commons-codec-1.2.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-collections-3.2.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/com
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> mons-lang-2.4.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/google-collections-1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/hadoop-core-0.20.1.jar:/out
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> brain/cassandra/apache-cassandra-0.6.1/bin/../lib/high-scale-lib.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/ivy-2.1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> bin/../lib/jackson-core-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jackson-mapper-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jline
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> -0.9.94.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/json-simple-1.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/libthrift-r917130.jar:/outbrain/cassandr
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> a/apache-cassandra-0.6.1/bin/../lib/log4j-1.2.14.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/slf4j-api-1.5.8.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib
>> >> >>>>>> /slf4j-log4j12-1.5.8.jar;
>> >> >>>>>>
>> >> >>>>>> BootClassPath =
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> /usr/java/jdk1.6.0_17/jre/lib/alt-rt.jar:/usr/java/jdk1.6.0_17/jre/lib/resources.jar:/usr/java/jdk1.6.0_17/jre/lib/rt.jar:/usr/java/jdk1.6.0_17/jre/lib/sunrsasign.j
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> ar:/usr/java/jdk1.6.0_17/jre/lib/jsse.jar:/usr/java/jdk1.6.0_17/jre/lib/jce.jar:/usr/java/jdk1.6.0_17/jre/lib/charsets.jar:/usr/java/jdk1.6.0_17/jre/classes;
>> >> >>>>>>
>> >> >>>>>> LibraryPath =
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> /usr/java/jdk1.6.0_17/jre/lib/amd64/server:/usr/java/jdk1.6.0_17/jre/lib/amd64:/usr/java/jdk1.6.0_17/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib;
>> >> >>>>>>
>> >> >>>>>> VmName = Java HotSpot(TM) 64-Bit Server VM;
>> >> >>>>>>
>> >> >>>>>> VmVendor = Sun Microsystems Inc.;
>> >> >>>>>>
>> >> >>>>>> VmVersion = 14.3-b01;
>> >> >>>>>>
>> >> >>>>>> BootClassPathSupported = true;
>> >> >>>>>>
>> >> >>>>>> InputArguments = [ -ea, -Xms128M, -Xmx4G,
>> >> >>>>>> -XX:TargetSurvivorRatio=90,
>> >> >>>>>> -XX:+AggressiveOpts, -XX:+UseParNewGC, -XX:+UseConcMarkSweepGC,
>> >> >>>>>> -XX:+CMSParallelRemarkEnabled, -XX:+HeapDumpOnOutOfMemoryError,
>> >> >>>>>> -XX:SurvivorRatio=128, -XX:MaxTenuringThreshold=0,
>> >> >>>>>> -Dcom.sun.management.jmxremote.port=9004,
>> >> >>>>>> -Dcom.sun.management.jmxremote.ssl=false,
>> >> >>>>>> -Dcom.sun.management.jmxremote.authenticate=false,
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> -Dstorage-config=/outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf,
>> >> >>>>>> -Dcassandra-pidfile=/var/run/cassandra.pid ];
>> >> >>>>>>
>> >> >>>>>> ManagementSpecVersion = 1.2;
>> >> >>>>>>
>> >> >>>>>> SpecName = Java Virtual Machine Specification;
>> >> >>>>>>
>> >> >>>>>> SpecVendor = Sun Microsystems Inc.;
>> >> >>>>>>
>> >> >>>>>> SpecVersion = 1.0;
>> >> >>>>>>
>> >> >>>>>> StartTime = 1272911001415;
>> >> >>>>>> ...
>> >> >>>>>
>> >> >>>
>> >> >>>
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Jonathan Ellis
>> >> > Project Chair, Apache Cassandra
>> >> > co-founder of Riptano, the source for professional Cassandra support
>> >> > http://riptano.com
>> >>
>> >
>> >
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of Riptano, the source for professional Cassandra support
>> http://riptano.com
>>
>
>

Re: performance tuning - where does the slowness come from?

Posted by Jonathan Ellis <jb...@gmail.com>.

We're working on better GC defaults for 0.6.2.  Thanks!

On Tue, May 11, 2010 at 12:00 PM, B. Todd Burruss <bb...@real.com> wrote:
> another note on this ... since all my nodes are very well balanced and were
> started at the same time, i notice that they all do garbage collection at
> about the same time.  this of course causes a performance issue.
>
> i also have noticed that with the default JVM options and heavy load,
> ConcMarkSweepGC can fall behind and require the JVM to unexpectedly pause
> while it plays catchup.  adding the following param can help this out, it
> says to start processing when "CMS Old Gen" memory is 88% used.
>
> -XX:CMSInitiatingOccupancyFraction=88
>
> my understanding of how the default is calculated, mine was about 92%, so i
> only lowered this 4%, but now i can see GC starting earlier and haven't had
> a pause like a saw before.
>
>
> On 05/06/2010 02:42 PM, Todd Burruss wrote:
>
> i think you will see a slow down because of large values in your columns.
> make sure you take a look at MemtableThroughputInMB in your config.  if you
> are writing 1MB of data per row, then you'll probably want to increase this
> quite a bit so you are not constantly creating sstables.  can't recall, did
> you see compaction mgr reporting a lot of pending compactions?  maybe try to
> "chunk" your data into multiple columns or multiple rows.
>
> i too see slowness that exhibits in the same manner as you guys have
> described.  i'm still trying to track it down as well.
>
> On 05/06/2010 10:56 AM, Ran Tavory wrote:
>
> Jonathan, I think it's the case of large values in the columns. The
> problematic CF is a key-value store, so it has only one column per row,
> however the value of that column can be large. It's a java serialized object
> (uncompressed) which, may be 100s of bytes, maybe even a few megs. This CF
> also suffers from zero cache hits since each time a read is for a unique
> key.
> I ran stress.py and I see much better results (reads are < 1ms) so I assume
> my cluster is healthy, so I need to fix the app. Would 1meg bytes object
> explain a 30ms (sometimes even more) read latency? The boxes aren't fancy,
> not sure exactly what hardware we have there but it's "commodity"...
> Thanks!
>
> On Thu, May 6, 2010 at 5:22 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>
>> columns, not CFs.
>>
>> put another way, how wide are the rows in the slow CF?
>>
>> On Wed, May 5, 2010 at 11:30 PM, Ran Tavory <ra...@gmail.com> wrote:
>> > I have a few CFs but the one I'm seeing slowness in, which is the one
>> > with
>> > plenty of cache misses has only one column per key.
>> > Latency varies b/w 10m and 60ms but I'd say average is 30ms.
>> >
>> > On Thu, May 6, 2010 at 4:25 AM, Jonathan Ellis <jb...@gmail.com>
>> > wrote:
>> >>
>> >> How many columns are in the rows you are reading from?
>> >>
>> >> 30ms is quite high, so I suspect you have relatively large rows, in
>> >> which case decreasing the column index threshold may help.
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of Riptano, the source for professional Cassandra support
>> http://riptano.com
>
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: performance tuning - where does the slowness come from?

Posted by "B. Todd Burruss" <bb...@real.com>.

another note on this ... since all my nodes are very well balanced and 
were started at the same time, i notice that they all do garbage 
collection at about the same time.  this of course causes a performance 
issue.

i also have noticed that with the default JVM options and heavy load, 
ConcMarkSweepGC can fall behind and require the JVM to unexpectedly 
pause while it plays catchup.  adding the following param can help this 
out, it says to start processing when "CMS Old Gen" memory is 88% used.

-XX:CMSInitiatingOccupancyFraction=88

my understanding of how the default is calculated, mine was about 92%, 
so i only lowered this 4%, but now i can see GC starting earlier and 
haven't had a pause like a saw before.


On 05/06/2010 02:42 PM, Todd Burruss wrote:
> i think you will see a slow down because of large values in your 
> columns.  make sure you take a look at MemtableThroughputInMB in your 
> config.  if you are writing 1MB of data per row, then you'll probably 
> want to increase this quite a bit so you are not constantly creating 
> sstables.  can't recall, did you see compaction mgr reporting a lot of 
> pending compactions?  maybe try to "chunk" your data into multiple 
> columns or multiple rows.
>
> i too see slowness that exhibits in the same manner as you guys have 
> described.  i'm still trying to track it down as well.
>
> On 05/06/2010 10:56 AM, Ran Tavory wrote:
>> Jonathan, I think it's the case of large values in the columns. The 
>> problematic CF is a key-value store, so it has only one column per 
>> row, however the value of that column can be large. It's a java 
>> serialized object (uncompressed) which, may be 100s of bytes, maybe 
>> even a few megs. This CF also suffers from zero cache hits since each 
>> time a read is for a unique key.
>>
>> I ran stress.py and I see much better results (reads are < 1ms) so I 
>> assume my cluster is healthy, so I need to fix the app. Would 1meg 
>> bytes object explain a 30ms (sometimes even more) read latency? The 
>> boxes aren't fancy, not sure exactly what hardware we have there but 
>> it's "commodity"...
>>
>> Thanks!
>>
>> On Thu, May 6, 2010 at 5:22 PM, Jonathan Ellis <jbellis@gmail.com 
>> <ma...@gmail.com>> wrote:
>>
>>     columns, not CFs.
>>
>>     put another way, how wide are the rows in the slow CF?
>>
>>     On Wed, May 5, 2010 at 11:30 PM, Ran Tavory <rantav@gmail.com
>>     <ma...@gmail.com>> wrote:
>>     > I have a few CFs but the one I'm seeing slowness in, which is
>>     the one with
>>     > plenty of cache misses has only one column per key.
>>     > Latency varies b/w 10m and 60ms but I'd say average is 30ms.
>>     >
>>     > On Thu, May 6, 2010 at 4:25 AM, Jonathan Ellis
>>     <jbellis@gmail.com <ma...@gmail.com>> wrote:
>>     >>
>>     >> How many columns are in the rows you are reading from?
>>     >>
>>     >> 30ms is quite high, so I suspect you have relatively large
>>     rows, in
>>     >> which case decreasing the column index threshold may help.
>>
>>     --
>>     Jonathan Ellis
>>     Project Chair, Apache Cassandra
>>     co-founder of Riptano, the source for professional Cassandra support
>>     http://riptano.com
>>
>>
>

Re: performance tuning - where does the slowness come from?

Posted by "B. Todd Burruss" <bb...@real.com>.

i think you will see a slow down because of large values in your 
columns.  make sure you take a look at MemtableThroughputInMB in your 
config.  if you are writing 1MB of data per row, then you'll probably 
want to increase this quite a bit so you are not constantly creating 
sstables.  can't recall, did you see compaction mgr reporting a lot of 
pending compactions?  maybe try to "chunk" your data into multiple 
columns or multiple rows.

i too see slowness that exhibits in the same manner as you guys have 
described.  i'm still trying to track it down as well.

On 05/06/2010 10:56 AM, Ran Tavory wrote:
> Jonathan, I think it's the case of large values in the columns. The 
> problematic CF is a key-value store, so it has only one column per 
> row, however the value of that column can be large. It's a java 
> serialized object (uncompressed) which, may be 100s of bytes, maybe 
> even a few megs. This CF also suffers from zero cache hits since each 
> time a read is for a unique key.
>
> I ran stress.py and I see much better results (reads are < 1ms) so I 
> assume my cluster is healthy, so I need to fix the app. Would 1meg 
> bytes object explain a 30ms (sometimes even more) read latency? The 
> boxes aren't fancy, not sure exactly what hardware we have there but 
> it's "commodity"...
>
> Thanks!
>
> On Thu, May 6, 2010 at 5:22 PM, Jonathan Ellis <jbellis@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     columns, not CFs.
>
>     put another way, how wide are the rows in the slow CF?
>
>     On Wed, May 5, 2010 at 11:30 PM, Ran Tavory <rantav@gmail.com
>     <ma...@gmail.com>> wrote:
>     > I have a few CFs but the one I'm seeing slowness in, which is
>     the one with
>     > plenty of cache misses has only one column per key.
>     > Latency varies b/w 10m and 60ms but I'd say average is 30ms.
>     >
>     > On Thu, May 6, 2010 at 4:25 AM, Jonathan Ellis
>     <jbellis@gmail.com <ma...@gmail.com>> wrote:
>     >>
>     >> How many columns are in the rows you are reading from?
>     >>
>     >> 30ms is quite high, so I suspect you have relatively large rows, in
>     >> which case decreasing the column index threshold may help.
>
>     --
>     Jonathan Ellis
>     Project Chair, Apache Cassandra
>     co-founder of Riptano, the source for professional Cassandra support
>     http://riptano.com
>
>

Re: performance tuning - where does the slowness come from?

Posted by Jonathan Ellis <jb...@gmail.com>.

Yes, that makes sense.  If you never have a warm cache then it's
probably disk seek time creating that latency, in which case there
isn't a whole lot you can do about it short of adding more capacity
(so at least it's cached at the OS level).

iostat -x could substantiate this guess.

On Thu, May 6, 2010 at 12:56 PM, Ran Tavory <ra...@gmail.com> wrote:
> Jonathan, I think it's the case of large values in the columns. The
> problematic CF is a key-value store, so it has only one column per row,
> however the value of that column can be large. It's a java serialized object
> (uncompressed) which, may be 100s of bytes, maybe even a few megs. This CF
> also suffers from zero cache hits since each time a read is for a unique
> key.
> I ran stress.py and I see much better results (reads are < 1ms) so I assume
> my cluster is healthy, so I need to fix the app. Would 1meg bytes object
> explain a 30ms (sometimes even more) read latency? The boxes aren't fancy,
> not sure exactly what hardware we have there but it's "commodity"...
> Thanks!
>
> On Thu, May 6, 2010 at 5:22 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>
>> columns, not CFs.
>>
>> put another way, how wide are the rows in the slow CF?
>>
>> On Wed, May 5, 2010 at 11:30 PM, Ran Tavory <ra...@gmail.com> wrote:
>> > I have a few CFs but the one I'm seeing slowness in, which is the one
>> > with
>> > plenty of cache misses has only one column per key.
>> > Latency varies b/w 10m and 60ms but I'd say average is 30ms.
>> >
>> > On Thu, May 6, 2010 at 4:25 AM, Jonathan Ellis <jb...@gmail.com>
>> > wrote:
>> >>
>> >> How many columns are in the rows you are reading from?
>> >>
>> >> 30ms is quite high, so I suspect you have relatively large rows, in
>> >> which case decreasing the column index threshold may help.
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of Riptano, the source for professional Cassandra support
>> http://riptano.com
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: performance tuning - where does the slowness come from?

Posted by Ran Tavory <ra...@gmail.com>.

Jonathan, I think it's the case of large values in the columns. The
problematic CF is a key-value store, so it has only one column per row,
however the value of that column can be large. It's a java serialized object
(uncompressed) which, may be 100s of bytes, maybe even a few megs. This CF
also suffers from zero cache hits since each time a read is for a unique
key.

I ran stress.py and I see much better results (reads are < 1ms) so I assume
my cluster is healthy, so I need to fix the app. Would 1meg bytes object
explain a 30ms (sometimes even more) read latency? The boxes aren't fancy,
not sure exactly what hardware we have there but it's "commodity"...

Thanks!

On Thu, May 6, 2010 at 5:22 PM, Jonathan Ellis <jb...@gmail.com> wrote:

> columns, not CFs.
>
> put another way, how wide are the rows in the slow CF?
>
> On Wed, May 5, 2010 at 11:30 PM, Ran Tavory <ra...@gmail.com> wrote:
> > I have a few CFs but the one I'm seeing slowness in, which is the one
> with
> > plenty of cache misses has only one column per key.
> > Latency varies b/w 10m and 60ms but I'd say average is 30ms.
> >
> > On Thu, May 6, 2010 at 4:25 AM, Jonathan Ellis <jb...@gmail.com>
> wrote:
> >>
> >> How many columns are in the rows you are reading from?
> >>
> >> 30ms is quite high, so I suspect you have relatively large rows, in
> >> which case decreasing the column index threshold may help.
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>

Re: performance tuning - where does the slowness come from?

Posted by Jonathan Ellis <jb...@gmail.com>.

columns, not CFs.

put another way, how wide are the rows in the slow CF?

On Wed, May 5, 2010 at 11:30 PM, Ran Tavory <ra...@gmail.com> wrote:
> I have a few CFs but the one I'm seeing slowness in, which is the one with
> plenty of cache misses has only one column per key.
> Latency varies b/w 10m and 60ms but I'd say average is 30ms.
>
> On Thu, May 6, 2010 at 4:25 AM, Jonathan Ellis <jb...@gmail.com> wrote:
>>
>> How many columns are in the rows you are reading from?
>>
>> 30ms is quite high, so I suspect you have relatively large rows, in
>> which case decreasing the column index threshold may help.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: performance tuning - where does the slowness come from?

Posted by Ran Tavory <ra...@gmail.com>.

I have a few CFs but the one I'm seeing slowness in, which is the one with
plenty of cache misses has only one column per key.
Latency varies b/w 10m and 60ms but I'd say average is 30ms.

On Thu, May 6, 2010 at 4:25 AM, Jonathan Ellis <jb...@gmail.com> wrote:

> How many columns are in the rows you are reading from?
>
> 30ms is quite high, so I suspect you have relatively large rows, in
> which case decreasing the column index threshold may help.
>
> On Wed, May 5, 2010 at 4:59 PM, Ran Tavory <ra...@gmail.com> wrote:
> > let's see if I can make some assertions, feel free to correct me...
> > Well, obviously, reads are much slower in cassandra than writes, everyone
> > knows that, but by which factor?
> > In my case I read/write only one column at a time. Key, column and value
> are
> > pretty small (< 200b)
> > So the numbers are usually - Write Latency: ~0.05ms, Read Latency: ~30ms.
> So
> > it looks like reads are 60x slower, at least on my hardware. This happens
> > when cache is cold. If cache is warm reads are better, but unfortunately
> my
> > cache is usually cold...
> > If my application keeps reading a cold cache there's nothing I can do to
> > make reads faster from cassandra's side. With one client thread this
> implies
> > 1000/30=33 reads/sec, not great.
> > However, although read latency is a bottleneck, read throughput isn't. So
> I
> > need to add more reading threads and I can actually add many of them
> before
> > read latency starts draining, according to ycsb I can have ~10000
> reads/sec
> > (on the cluster they tested, numbers may vary) before read latency starts
> > draining. So, numbers may vary by cluster size, hardware, data size etc,
> but
> > the idea is - if read latency is 30ms and cache is a miss most of the
> time,
> > that's normal, just add more reader threads and you get a better
> throughput.
> > Sorry if this sounds trivial, I was just trying to improve on the 30ms
> reads
> > until I realized I actually can't...
> > On Wed, May 5, 2010 at 7:08 PM, Jonathan Ellis <jb...@gmail.com>
> wrote:
> >>
> >>  - your key cache isn't warm.  capacity 17M, size 0.5M, 468083 reads
> >> sounds like most of your reads have been for unique keys.
> >>  - the kind of reads you are doing can have a big effect (mostly
> >> number of columns you are asking for).  column index granularity plays
> >> a role (for non-rowcached reads); so can column comparator (see e.g.
> >> https://issues.apache.org/jira/browse/CASSANDRA-1043)
> >>  - the slow system reads are all on HH rows, which can get very wide
> >> (hence, slow to read the whole row, which is what the HH code does).
> >> clean those out either by bringing back the nodes it's hinting for, or
> >> just removing the HH data files.
> >>
> >> On Wed, May 5, 2010 at 10:19 AM, Ran Tavory <ra...@gmail.com> wrote:
> >> > I'm still trying to figure out where my slowness is coming from...
> >> > By now I'm pretty sure it's the reads are slow, but not sure how to
> >> > improve
> >> > them.
> >> > I'm looking at cfstats. Can you say if there are better configuration
> >> > options? So far I've used all default settings, except for:
> >> >     <Keyspace Name="outbrain_kvdb">
> >> >       <ColumnFamily CompareWith="BytesType" Name="KvImpressions"
> >> > KeysCached="50%"/>
> >> >
> >> >
> >> >
>  <ReplicaPlacementStrategy>org.apache.cassandra.locator.RackAwareStrategy</ReplicaPlacementStrategy>
> >> >       <ReplicationFactor>2</ReplicationFactor>
> >> >
> >> >
> >> >
>  <EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch>
> >> >     </Keyspace>
> >> >
> >> > What does a good read latency look like? I was expecting 10ms, however
> >> > so
> >> > far it seems that my KvImpressions read latency is 30ms and in the
> >> > system
> >> > keyspace I have 800ms :(
> >> > I thought adding KeysCached="50%" would improve my situation but
> >> > unfortunately looks like the hitrate is about 0. I realize that's
> >> > application specific, but maybe there are other magic bullets...
> >> > Is there something like adding cache to the system keyspace? 800 ms is
> >> > pretty bad, isn't it?
> >> > See stats below and thanks.
> >> >
> >> > Keyspace: outbrain_kvdb
> >> >         Read Count: 651668
> >> >         Read Latency: 34.18622328547666 ms.
> >> >         Write Count: 655542
> >> >         Write Latency: 0.041145092152752985 ms.
> >> >         Pending Tasks: 0
> >> >                 Column Family: KvImpressions
> >> >                 SSTable count: 13
> >> >                 Space used (live): 23304548897
> >> >                 Space used (total): 23304548897
> >> >                 Memtable Columns Count: 895
> >> >                 Memtable Data Size: 2108990
> >> >                 Memtable Switch Count: 8
> >> >                 Read Count: 468083
> >> >                 Read Latency: 151.603 ms.
> >> >                 Write Count: 552566
> >> >                 Write Latency: 0.023 ms.
> >> >                 Pending Tasks: 0
> >> >                 Key cache capacity: 17398656
> >> >                 Key cache size: 567967
> >> >                 Key cache hit rate: 0.0
> >> >                 Row cache: disabled
> >> >                 Compacted row minimum size: 269
> >> >                 Compacted row maximum size: 54501
> >> >                 Compacted row mean size: 933
> >> > ...
> >> > ----------------
> >> > Keyspace: system
> >> >         Read Count: 1151
> >> >         Read Latency: 872.5014448305822 ms.
> >> >         Write Count: 51215
> >> >         Write Latency: 0.07156788050375866 ms.
> >> >         Pending Tasks: 0
> >> >                 Column Family: HintsColumnFamily
> >> >                 SSTable count: 5
> >> >                 Space used (live): 437366878
> >> >                 Space used (total): 437366878
> >> >                 Memtable Columns Count: 14987
> >> >                 Memtable Data Size: 87975
> >> >                 Memtable Switch Count: 2
> >> >                 Read Count: 1150
> >> >                 Read Latency: NaN ms.
> >> >                 Write Count: 51211
> >> >                 Write Latency: 0.027 ms.
> >> >                 Pending Tasks: 0
> >> >                 Key cache capacity: 6
> >> >                 Key cache size: 4
> >> >                 Key cache hit rate: NaN
> >> >                 Row cache: disabled
> >> >                 Compacted row minimum size: 0
> >> >                 Compacted row maximum size: 0
> >> >                 Compacted row mean size: 0
> >> >                 Column Family: LocationInfo
> >> >                 SSTable count: 2
> >> >                 Space used (live): 3504
> >> >                 Space used (total): 3504
> >> >                 Memtable Columns Count: 0
> >> >                 Memtable Data Size: 0
> >> >                 Memtable Switch Count: 1
> >> >                 Read Count: 1
> >> >                 Read Latency: NaN ms.
> >> >                 Write Count: 7
> >> >                 Write Latency: NaN ms.
> >> >                 Pending Tasks: 0
> >> >                 Key cache capacity: 2
> >> >                 Key cache size: 1
> >> >                 Key cache hit rate: NaN
> >> >                 Row cache: disabled
> >> >                 Compacted row minimum size: 0
> >> >                 Compacted row maximum size: 0
> >> >                 Compacted row mean size: 0
> >> >
> >> > On Tue, May 4, 2010 at 10:57 PM, Kyusik Chung <
> kyusik@discovereads.com>
> >> > wrote:
> >> >>
> >> >> Im using Ubuntu 8.04 on 64 bit hosts on rackspace cloud.
> >> >>
> >> >> Im in the middle of repeating some perf tests, but so far, I get
> >> >> as-good
> >> >> or slightly better read perf by using standard disk access mode vs
> >> >> mmap.  So
> >> >> far consecutive tests are returning consistent numbers.
> >> >>
> >> >> Im not sure how to explain it...maybe its an ubuntu 8.04 issue with
> >> >> mmap.
> >> >>  Back when I was using mmap, I was definitely seeing the kswapd0
> >> >> process
> >> >> start using cpu as the box ran out of memory, and read performance
> >> >> significantly degraded.
> >> >>
> >> >> Next, Ill run some tests with mmap_index_only, and Ill test with
> heavy
> >> >> concurrent writes as well as reads.  Ill let everyone know what I
> find.
> >> >>
> >> >> Kyusik Chung
> >> >> CEO, Discovereads.com
> >> >> kyusik@discovereads.com
> >> >>
> >> >> On May 4, 2010, at 12:27 PM, Jonathan Ellis wrote:
> >> >>
> >> >> > Are you using 32 bit hosts?  If not don't be scared of mmap using a
> >> >> > lot of address space, you have plenty.  It won't make you swap more
> >> >> > than using buffered i/o.
> >> >> >
> >> >> > On Tue, May 4, 2010 at 1:57 PM, Ran Tavory <ra...@gmail.com>
> wrote:
> >> >> >> I canceled mmap and indeed memory usage is sane again. So far
> >> >> >> performance
> >> >> >> hasn't been great, but I'll wait and see.
> >> >> >> I'm also interested in a way to cap mmap so I can take advantage
> of
> >> >> >> it
> >> >> >> but
> >> >> >> not swap the host to death...
> >> >> >>
> >> >> >> On Tue, May 4, 2010 at 9:38 PM, Kyusik Chung
> >> >> >> <ky...@discovereads.com>
> >> >> >> wrote:
> >> >> >>>
> >> >> >>> This sounds just like the slowness I was asking about in another
> >> >> >>> thread -
> >> >> >>> after a lot of reads, the machine uses up all available memory on
> >> >> >>> the
> >> >> >>> box
> >> >> >>> and then starts swapping.
> >> >> >>> My understanding was that mmap helps greatly with read and write
> >> >> >>> perf
> >> >> >>> (until the box starts swapping I guess)...is there any way to use
> >> >> >>> mmap
> >> >> >>> and
> >> >> >>> cap how much memory it takes up?
> >> >> >>> What do people use in production?  mmap or no mmap?
> >> >> >>> Thanks!
> >> >> >>> Kyusik Chung
> >> >> >>> On May 4, 2010, at 10:11 AM, Schubert Zhang wrote:
> >> >> >>>
> >> >> >>> 1. When initially startup your nodes, please plan your
> InitialToken
> >> >> >>> of
> >> >> >>> each node evenly.
> >> >> >>> 2. <DiskAccessMode>standard</DiskAccessMode>
> >> >> >>>
> >> >> >>> On Tue, May 4, 2010 at 9:09 PM, Boris Shulman <
> shulmanb@gmail.com>
> >> >> >>> wrote:
> >> >> >>>>
> >> >> >>>> I think that the extra (more than 4GB) memory usage comes from
> the
> >> >> >>>> mmaped io, that is why it happens only for reads.
> >> >> >>>>
> >> >> >>>> On Tue, May 4, 2010 at 2:02 PM, Jordan Pittier
> >> >> >>>> <jo...@gmail.com>
> >> >> >>>> wrote:
> >> >> >>>>> I'm facing the same issue with swap. It only occurs when I
> >> >> >>>>> perform
> >> >> >>>>> read
> >> >> >>>>> operations (write are very fast :)). So I can't help you with
> the
> >> >> >>>>> memory
> >> >> >>>>> probleme.
> >> >> >>>>>
> >> >> >>>>> But to balance the load evenly between nodes in cluster just
> >> >> >>>>> manually
> >> >> >>>>> fix
> >> >> >>>>> their token.(the "formula" is i * 2^127 / nb_nodes).
> >> >> >>>>>
> >> >> >>>>> Jordzn
> >> >> >>>>>
> >> >> >>>>> On Tue, May 4, 2010 at 8:20 AM, Ran Tavory <ra...@gmail.com>
> >> >> >>>>> wrote:
> >> >> >>>>>>
> >> >> >>>>>> I'm looking into performance issues on a 0.6.1 cluster. I see
> >> >> >>>>>> two
> >> >> >>>>>> symptoms:
> >> >> >>>>>> 1. Reads and writes are slow
> >> >> >>>>>> 2. One of the hosts is doing a lot of GC.
> >> >> >>>>>> 1 is slow in the sense that in normal state the cluster used
> to
> >> >> >>>>>> make
> >> >> >>>>>> around 3-5k read and writes per second (6-10k operations per
> >> >> >>>>>> second),
> >> >> >>>>>> but
> >> >> >>>>>> how it's in the order of 200-400 ops per second, sometimes
> even
> >> >> >>>>>> less.
> >> >> >>>>>> 2 looks like this:
> >> >> >>>>>> $ tail -f /outbrain/cassandra/log/system.log
> >> >> >>>>>>  INFO [GC inspection] 2010-05-04 00:42:18,636 GCInspector.java
> >> >> >>>>>> (line
> >> >> >>>>>> 110)
> >> >> >>>>>> GC for ParNew: 672 ms, 166482384 reclaimed leaving 2872087208
> >> >> >>>>>> used;
> >> >> >>>>>> max is
> >> >> >>>>>> 4432068608
> >> >> >>>>>>  INFO [GC inspection] 2010-05-04 00:42:28,638 GCInspector.java
> >> >> >>>>>> (line
> >> >> >>>>>> 110)
> >> >> >>>>>> GC for ParNew: 498 ms, 166493352 reclaimed leaving 2836049448
> >> >> >>>>>> used;
> >> >> >>>>>> max is
> >> >> >>>>>> 4432068608
> >> >> >>>>>>  INFO [GC inspection] 2010-05-04 00:42:38,640 GCInspector.java
> >> >> >>>>>> (line
> >> >> >>>>>> 110)
> >> >> >>>>>> GC for ParNew: 327 ms, 166091528 reclaimed leaving 2796888424
> >> >> >>>>>> used;
> >> >> >>>>>> max is
> >> >> >>>>>> 4432068608
> >> >> >>>>>> ... and it goes on and on for hours, no stopping...
> >> >> >>>>>> The cluster is made of 6 hosts, 3 in one DC and 3 in another.
> >> >> >>>>>> Each host has 8G RAM.
> >> >> >>>>>> -Xmx=4G
> >> >> >>>>>> For some reason, the load isn't distributed evenly b/w the
> >> >> >>>>>> hosts,
> >> >> >>>>>> although
> >> >> >>>>>> I'm not sure this is the cause for slowness
> >> >> >>>>>> $ nodetool -h localhost -p 9004 ring
> >> >> >>>>>> Address       Status     Load          Range
> >> >> >>>>>>        Ring
> >> >> >>>>>>
> >> >> >>>>>> 144413773383729447702215082383444206680
> >> >> >>>>>> 192.168.252.99Up         15.94 GB
> >> >> >>>>>>  66002764663998929243644931915471302076     |<--|
> >> >> >>>>>> 192.168.254.57Up         19.84 GB
> >> >> >>>>>>  81288739225600737067856268063987022738     |   ^
> >> >> >>>>>> 192.168.254.58Up         973.78 MB
> >> >> >>>>>> 86999744104066390588161689990810839743     v   |
> >> >> >>>>>> 192.168.252.62Up         5.18 GB
> >> >> >>>>>> 88308919879653155454332084719458267849     |   ^
> >> >> >>>>>> 192.168.254.59Up         10.57 GB
> >> >> >>>>>>  142482163220375328195837946953175033937    v   |
> >> >> >>>>>> 192.168.252.61Up         11.36 GB
> >> >> >>>>>>  144413773383729447702215082383444206680    |-->|
> >> >> >>>>>> The slow host is 192.168.252.61 and it isn't the most loaded
> >> >> >>>>>> one.
> >> >> >>>>>> The host is waiting a lot on IO and the load average is
> usually
> >> >> >>>>>> 6-7
> >> >> >>>>>> $ w
> >> >> >>>>>>  00:42:56 up 11 days, 13:22,  1 user,  load average: 6.21,
> 5.52,
> >> >> >>>>>> 3.93
> >> >> >>>>>> $ vmstat 5
> >> >> >>>>>> procs -----------memory---------- ---swap-- -----io----
> >> >> >>>>>> --system--
> >> >> >>>>>> -----cpu------
> >> >> >>>>>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in
> >> >> >>>>>> cs
> >> >> >>>>>> us
> >> >> >>>>>> sy id
> >> >> >>>>>> wa st
> >> >> >>>>>>  0  8 2147844  45744   1816 4457384    6    5    66    32    5
> >> >> >>>>>>  2
> >> >> >>>>>>  1
> >> >> >>>>>>  1
> >> >> >>>>>> 96  2  0
> >> >> >>>>>>  0  8 2147164  49020   1808 4451596  385    0  2345    58 3372
> >> >> >>>>>> 9957
> >> >> >>>>>>  2
> >> >> >>>>>>  2
> >> >> >>>>>> 78 18  0
> >> >> >>>>>>  0  3 2146432  45704   1812 4453956  342    0  2274   108 3937
> >> >> >>>>>> 10732
> >> >> >>>>>>  2  2
> >> >> >>>>>> 78 19  0
> >> >> >>>>>>  0  1 2146252  44696   1804 4453436  345  164  1939   294 3647
> >> >> >>>>>> 7833
> >> >> >>>>>>  2
> >> >> >>>>>>  2
> >> >> >>>>>> 78 18  0
> >> >> >>>>>>  0  1 2145960  46924   1744 4451260  158    0  2423   122 4354
> >> >> >>>>>> 14597
> >> >> >>>>>>  2  2
> >> >> >>>>>> 77 18  0
> >> >> >>>>>>  7  1 2138344  44676    952 4504148 1722  403  1722   406 1388
> >> >> >>>>>>  439
> >> >> >>>>>> 87
> >> >> >>>>>>  0
> >> >> >>>>>> 10  2  0
> >> >> >>>>>>  7  2 2137248  45652    956 4499436 1384  655  1384   658 1356
> >> >> >>>>>>  392
> >> >> >>>>>> 87
> >> >> >>>>>>  0
> >> >> >>>>>> 10  3  0
> >> >> >>>>>>  7  1 2135976  46764    956 4495020 1366  718  1366   718 1395
> >> >> >>>>>>  380
> >> >> >>>>>> 87
> >> >> >>>>>>  0
> >> >> >>>>>>  9  4  0
> >> >> >>>>>>  0  8 2134484  46964    956 4489420 1673  555  1814   586 1601
> >> >> >>>>>> 215590
> >> >> >>>>>> 14
> >> >> >>>>>>  2 68 16  0
> >> >> >>>>>>  0  1 2135388  47444    972 4488516  785  833  2390   995 3812
> >> >> >>>>>> 8305
> >> >> >>>>>>  2
> >> >> >>>>>>  2
> >> >> >>>>>> 77 20  0
> >> >> >>>>>>  0 10 2135164  45928    980 4488796  788  543  2275   626 36
> >> >> >>>>>> So, the host is swapping like crazy...
> >> >> >>>>>> top shows that it's using a lot of memory. As noted before
> >> >> >>>>>> -Xmx=4G
> >> >> >>>>>> and
> >> >> >>>>>> nothing else seems to be using a lot of memory on the host
> >> >> >>>>>> except
> >> >> >>>>>> for
> >> >> >>>>>> the
> >> >> >>>>>> cassandra process, however, of the 8G ram on the host, 92% is
> >> >> >>>>>> used
> >> >> >>>>>> by
> >> >> >>>>>> cassandra. How's that?
> >> >> >>>>>> Top shows there's 3.9g Shared and 7.2g Resident and 15.9g
> >> >> >>>>>> Virtual.
> >> >> >>>>>> Why
> >> >> >>>>>> does it have 15g virtual? And why 7.2 RES? This can explain
> the
> >> >> >>>>>> slowness in
> >> >> >>>>>> swapping.
> >> >> >>>>>> $ top
> >> >> >>>>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
> >> >> >>>>>>  COMMAND
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>> 20281 cassandr  25   0 15.9g 7.2g 3.9g S 33.3 92.6 175:30.27
> >> >> >>>>>> java
> >> >> >>>>>> So, can the total memory be controlled?
> >> >> >>>>>> Or perhaps I'm looking in the wrong direction...
> >> >> >>>>>> I've looked at all the cassandra JMX counts and nothing seemed
> >> >> >>>>>> suspicious
> >> >> >>>>>> so far. By suspicious i mean a large number of pending tasks -
> >> >> >>>>>> there
> >> >> >>>>>> were
> >> >> >>>>>> always very small numbers in each pool.
> >> >> >>>>>> About read and write latencies, I'm not sure what the normal
> >> >> >>>>>> state
> >> >> >>>>>> is,
> >> >> >>>>>> but
> >> >> >>>>>> here's an example of what I see on the problematic host:
> >> >> >>>>>> #mbean = org.apache.cassandra.service:type=StorageProxy:
> >> >> >>>>>> RecentReadLatencyMicros = 30105.888180684495;
> >> >> >>>>>> TotalReadLatencyMicros = 78543052801;
> >> >> >>>>>> TotalWriteLatencyMicros = 4213118609;
> >> >> >>>>>> RecentWriteLatencyMicros = 1444.4809201925639;
> >> >> >>>>>> ReadOperations = 4779553;
> >> >> >>>>>> RangeOperations = 0;
> >> >> >>>>>> TotalRangeLatencyMicros = 0;
> >> >> >>>>>> RecentRangeLatencyMicros = NaN;
> >> >> >>>>>> WriteOperations = 4740093;
> >> >> >>>>>> And the only pool that I do see some pending tasks is the
> >> >> >>>>>> ROW-READ-STAGE,
> >> >> >>>>>> but it doesn't look like much, usually around 6-8:
> >> >> >>>>>> #mbean = org.apache.cassandra.concurrent:type=ROW-READ-STAGE:
> >> >> >>>>>> ActiveCount = 8;
> >> >> >>>>>> PendingTasks = 8;
> >> >> >>>>>> CompletedTasks = 5427955;
> >> >> >>>>>> Any help finding the solution is appreciated, thanks...
> >> >> >>>>>> Below are a few more JMXes I collected from the system that
> may
> >> >> >>>>>> be
> >> >> >>>>>> interesting.
> >> >> >>>>>> #mbean = java.lang:type=Memory:
> >> >> >>>>>> Verbose = false;
> >> >> >>>>>> HeapMemoryUsage = {
> >> >> >>>>>>   committed = 3767279616;
> >> >> >>>>>>   init = 134217728;
> >> >> >>>>>>   max = 4293656576;
> >> >> >>>>>>   used = 1237105080;
> >> >> >>>>>>  };
> >> >> >>>>>> NonHeapMemoryUsage = {
> >> >> >>>>>>   committed = 35061760;
> >> >> >>>>>>   init = 24313856;
> >> >> >>>>>>   max = 138412032;
> >> >> >>>>>>   used = 23151320;
> >> >> >>>>>>  };
> >> >> >>>>>> ObjectPendingFinalizationCount = 0;
> >> >> >>>>>> #mbean = java.lang:name=ParNew,type=GarbageCollector:
> >> >> >>>>>> LastGcInfo = {
> >> >> >>>>>>   GcThreadCount = 11;
> >> >> >>>>>>   duration = 136;
> >> >> >>>>>>   endTime = 42219272;
> >> >> >>>>>>   id = 11719;
> >> >> >>>>>>   memoryUsageAfterGc = {
> >> >> >>>>>>     ( CMS Perm Gen ) = {
> >> >> >>>>>>       key = CMS Perm Gen;
> >> >> >>>>>>       value = {
> >> >> >>>>>>         committed = 29229056;
> >> >> >>>>>>         init = 21757952;
> >> >> >>>>>>         max = 88080384;
> >> >> >>>>>>         used = 17648848;
> >> >> >>>>>>        };
> >> >> >>>>>>      };
> >> >> >>>>>>     ( Code Cache ) = {
> >> >> >>>>>>       key = Code Cache;
> >> >> >>>>>>       value = {
> >> >> >>>>>>         committed = 5832704;
> >> >> >>>>>>         init = 2555904;
> >> >> >>>>>>         max = 50331648;
> >> >> >>>>>>         used = 5563520;
> >> >> >>>>>>        };
> >> >> >>>>>>      };
> >> >> >>>>>>     ( CMS Old Gen ) = {
> >> >> >>>>>>       key = CMS Old Gen;
> >> >> >>>>>>       value = {
> >> >> >>>>>>         committed = 3594133504;
> >> >> >>>>>>         init = 112459776;
> >> >> >>>>>>         max = 4120510464;
> >> >> >>>>>>         used = 964565720;
> >> >> >>>>>>        };
> >> >> >>>>>>      };
> >> >> >>>>>>     ( Par Eden Space ) = {
> >> >> >>>>>>       key = Par Eden Space;
> >> >> >>>>>>       value = {
> >> >> >>>>>>         committed = 171835392;
> >> >> >>>>>>         init = 21495808;
> >> >> >>>>>>         max = 171835392;
> >> >> >>>>>>         used = 0;
> >> >> >>>>>>        };
> >> >> >>>>>>      };
> >> >> >>>>>>     ( Par Survivor Space ) = {
> >> >> >>>>>>       key = Par Survivor Space;
> >> >> >>>>>>       value = {
> >> >> >>>>>>         committed = 1310720;
> >> >> >>>>>>         init = 131072;
> >> >> >>>>>>         max = 1310720;
> >> >> >>>>>>         used = 0;
> >> >> >>>>>>        };
> >> >> >>>>>>      };
> >> >> >>>>>>    };
> >> >> >>>>>>   memoryUsageBeforeGc = {
> >> >> >>>>>>     ( CMS Perm Gen ) = {
> >> >> >>>>>>       key = CMS Perm Gen;
> >> >> >>>>>>       value = {
> >> >> >>>>>>         committed = 29229056;
> >> >> >>>>>>         init = 21757952;
> >> >> >>>>>>         max = 88080384;
> >> >> >>>>>>         used = 17648848;
> >> >> >>>>>>        };
> >> >> >>>>>>      };
> >> >> >>>>>>     ( Code Cache ) = {
> >> >> >>>>>>       key = Code Cache;
> >> >> >>>>>>       value = {
> >> >> >>>>>>         committed = 5832704;
> >> >> >>>>>>         init = 2555904;
> >> >> >>>>>>         max = 50331648;
> >> >> >>>>>>         used = 5563520;
> >> >> >>>>>>        };
> >> >> >>>>>>      };
> >> >> >>>>>>     ( CMS Old Gen ) = {
> >> >> >>>>>>       key = CMS Old Gen;
> >> >> >>>>>>       value = {
> >> >> >>>>>>         committed = 3594133504;
> >> >> >>>>>>         init = 112459776;
> >> >> >>>>>>         max = 4120510464;
> >> >> >>>>>>         used = 959221872;
> >> >> >>>>>>        };
> >> >> >>>>>>      };
> >> >> >>>>>>     ( Par Eden Space ) = {
> >> >> >>>>>>       key = Par Eden Space;
> >> >> >>>>>>       value = {
> >> >> >>>>>>         committed = 171835392;
> >> >> >>>>>>         init = 21495808;
> >> >> >>>>>>         max = 171835392;
> >> >> >>>>>>         used = 171835392;
> >> >> >>>>>>        };
> >> >> >>>>>>      };
> >> >> >>>>>>     ( Par Survivor Space ) = {
> >> >> >>>>>>       key = Par Survivor Space;
> >> >> >>>>>>       value = {
> >> >> >>>>>>         committed = 1310720;
> >> >> >>>>>>         init = 131072;
> >> >> >>>>>>         max = 1310720;
> >> >> >>>>>>         used = 0;
> >> >> >>>>>>        };
> >> >> >>>>>>      };
> >> >> >>>>>>    };
> >> >> >>>>>>   startTime = 42219136;
> >> >> >>>>>>  };
> >> >> >>>>>> CollectionCount = 11720;
> >> >> >>>>>> CollectionTime = 4561730;
> >> >> >>>>>> Name = ParNew;
> >> >> >>>>>> Valid = true;
> >> >> >>>>>> MemoryPoolNames = [ Par Eden Space, Par Survivor Space ];
> >> >> >>>>>> #mbean = java.lang:type=OperatingSystem:
> >> >> >>>>>> MaxFileDescriptorCount = 63536;
> >> >> >>>>>> OpenFileDescriptorCount = 75;
> >> >> >>>>>> CommittedVirtualMemorySize = 17787711488;
> >> >> >>>>>> FreePhysicalMemorySize = 45522944;
> >> >> >>>>>> FreeSwapSpaceSize = 2123968512;
> >> >> >>>>>> ProcessCpuTime = 12251460000000;
> >> >> >>>>>> TotalPhysicalMemorySize = 8364417024;
> >> >> >>>>>> TotalSwapSpaceSize = 4294959104;
> >> >> >>>>>> Name = Linux;
> >> >> >>>>>> AvailableProcessors = 8;
> >> >> >>>>>> Arch = amd64;
> >> >> >>>>>> SystemLoadAverage = 4.36;
> >> >> >>>>>> Version = 2.6.18-164.15.1.el5;
> >> >> >>>>>> #mbean = java.lang:type=Runtime:
> >> >> >>>>>> Name = 20281@ob1061.nydc1.outbrain.com;
> >> >> >>>>>>
> >> >> >>>>>> ClassPath =
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> /outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../build/classes:/outbrain/cassandra/apache-cassandra-0.6.1/bin/..
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> /lib/antlr-3.1.3.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/apache-cassandra-0.6.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/avro-1.2.0-dev.jar:/outb
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> rain/cassandra/apache-cassandra-0.6.1/bin/../lib/clhm-production.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-cli-1.1.jar:/outbrain/cassandra/apache-cassandra-
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> 0.6.1/bin/../lib/commons-codec-1.2.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-collections-3.2.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/com
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> mons-lang-2.4.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/google-collections-1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/hadoop-core-0.20.1.jar:/out
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> brain/cassandra/apache-cassandra-0.6.1/bin/../lib/high-scale-lib.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/ivy-2.1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> bin/../lib/jackson-core-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jackson-mapper-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jline
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> -0.9.94.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/json-simple-1.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/libthrift-r917130.jar:/outbrain/cassandr
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> a/apache-cassandra-0.6.1/bin/../lib/log4j-1.2.14.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/slf4j-api-1.5.8.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib
> >> >> >>>>>> /slf4j-log4j12-1.5.8.jar;
> >> >> >>>>>>
> >> >> >>>>>> BootClassPath =
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> /usr/java/jdk1.6.0_17/jre/lib/alt-rt.jar:/usr/java/jdk1.6.0_17/jre/lib/resources.jar:/usr/java/jdk1.6.0_17/jre/lib/rt.jar:/usr/java/jdk1.6.0_17/jre/lib/sunrsasign.j
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> ar:/usr/java/jdk1.6.0_17/jre/lib/jsse.jar:/usr/java/jdk1.6.0_17/jre/lib/jce.jar:/usr/java/jdk1.6.0_17/jre/lib/charsets.jar:/usr/java/jdk1.6.0_17/jre/classes;
> >> >> >>>>>>
> >> >> >>>>>> LibraryPath =
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> /usr/java/jdk1.6.0_17/jre/lib/amd64/server:/usr/java/jdk1.6.0_17/jre/lib/amd64:/usr/java/jdk1.6.0_17/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib;
> >> >> >>>>>>
> >> >> >>>>>> VmName = Java HotSpot(TM) 64-Bit Server VM;
> >> >> >>>>>>
> >> >> >>>>>> VmVendor = Sun Microsystems Inc.;
> >> >> >>>>>>
> >> >> >>>>>> VmVersion = 14.3-b01;
> >> >> >>>>>>
> >> >> >>>>>> BootClassPathSupported = true;
> >> >> >>>>>>
> >> >> >>>>>> InputArguments = [ -ea, -Xms128M, -Xmx4G,
> >> >> >>>>>> -XX:TargetSurvivorRatio=90,
> >> >> >>>>>> -XX:+AggressiveOpts, -XX:+UseParNewGC,
> -XX:+UseConcMarkSweepGC,
> >> >> >>>>>> -XX:+CMSParallelRemarkEnabled,
> -XX:+HeapDumpOnOutOfMemoryError,
> >> >> >>>>>> -XX:SurvivorRatio=128, -XX:MaxTenuringThreshold=0,
> >> >> >>>>>> -Dcom.sun.management.jmxremote.port=9004,
> >> >> >>>>>> -Dcom.sun.management.jmxremote.ssl=false,
> >> >> >>>>>> -Dcom.sun.management.jmxremote.authenticate=false,
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> >> >> >>>>>>
> -Dstorage-config=/outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf,
> >> >> >>>>>> -Dcassandra-pidfile=/var/run/cassandra.pid ];
> >> >> >>>>>>
> >> >> >>>>>> ManagementSpecVersion = 1.2;
> >> >> >>>>>>
> >> >> >>>>>> SpecName = Java Virtual Machine Specification;
> >> >> >>>>>>
> >> >> >>>>>> SpecVendor = Sun Microsystems Inc.;
> >> >> >>>>>>
> >> >> >>>>>> SpecVersion = 1.0;
> >> >> >>>>>>
> >> >> >>>>>> StartTime = 1272911001415;
> >> >> >>>>>> ...
> >> >> >>>>>
> >> >> >>>
> >> >> >>>
> >> >> >>
> >> >> >>
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > Jonathan Ellis
> >> >> > Project Chair, Apache Cassandra
> >> >> > co-founder of Riptano, the source for professional Cassandra
> support
> >> >> > http://riptano.com
> >> >>
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Jonathan Ellis
> >> Project Chair, Apache Cassandra
> >> co-founder of Riptano, the source for professional Cassandra support
> >> http://riptano.com
> >
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>

Re: performance tuning - where does the slowness come from?

Posted by Jonathan Ellis <jb...@gmail.com>.

How many columns are in the rows you are reading from?

30ms is quite high, so I suspect you have relatively large rows, in
which case decreasing the column index threshold may help.

On Wed, May 5, 2010 at 4:59 PM, Ran Tavory <ra...@gmail.com> wrote:
> let's see if I can make some assertions, feel free to correct me...
> Well, obviously, reads are much slower in cassandra than writes, everyone
> knows that, but by which factor?
> In my case I read/write only one column at a time. Key, column and value are
> pretty small (< 200b)
> So the numbers are usually - Write Latency: ~0.05ms, Read Latency: ~30ms. So
> it looks like reads are 60x slower, at least on my hardware. This happens
> when cache is cold. If cache is warm reads are better, but unfortunately my
> cache is usually cold...
> If my application keeps reading a cold cache there's nothing I can do to
> make reads faster from cassandra's side. With one client thread this implies
> 1000/30=33 reads/sec, not great.
> However, although read latency is a bottleneck, read throughput isn't. So I
> need to add more reading threads and I can actually add many of them before
> read latency starts draining, according to ycsb I can have ~10000 reads/sec
> (on the cluster they tested, numbers may vary) before read latency starts
> draining. So, numbers may vary by cluster size, hardware, data size etc, but
> the idea is - if read latency is 30ms and cache is a miss most of the time,
> that's normal, just add more reader threads and you get a better throughput.
> Sorry if this sounds trivial, I was just trying to improve on the 30ms reads
> until I realized I actually can't...
> On Wed, May 5, 2010 at 7:08 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>
>>  - your key cache isn't warm.  capacity 17M, size 0.5M, 468083 reads
>> sounds like most of your reads have been for unique keys.
>>  - the kind of reads you are doing can have a big effect (mostly
>> number of columns you are asking for).  column index granularity plays
>> a role (for non-rowcached reads); so can column comparator (see e.g.
>> https://issues.apache.org/jira/browse/CASSANDRA-1043)
>>  - the slow system reads are all on HH rows, which can get very wide
>> (hence, slow to read the whole row, which is what the HH code does).
>> clean those out either by bringing back the nodes it's hinting for, or
>> just removing the HH data files.
>>
>> On Wed, May 5, 2010 at 10:19 AM, Ran Tavory <ra...@gmail.com> wrote:
>> > I'm still trying to figure out where my slowness is coming from...
>> > By now I'm pretty sure it's the reads are slow, but not sure how to
>> > improve
>> > them.
>> > I'm looking at cfstats. Can you say if there are better configuration
>> > options? So far I've used all default settings, except for:
>> >     <Keyspace Name="outbrain_kvdb">
>> >       <ColumnFamily CompareWith="BytesType" Name="KvImpressions"
>> > KeysCached="50%"/>
>> >
>> >
>> >  <ReplicaPlacementStrategy>org.apache.cassandra.locator.RackAwareStrategy</ReplicaPlacementStrategy>
>> >       <ReplicationFactor>2</ReplicationFactor>
>> >
>> >
>> >  <EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch>
>> >     </Keyspace>
>> >
>> > What does a good read latency look like? I was expecting 10ms, however
>> > so
>> > far it seems that my KvImpressions read latency is 30ms and in the
>> > system
>> > keyspace I have 800ms :(
>> > I thought adding KeysCached="50%" would improve my situation but
>> > unfortunately looks like the hitrate is about 0. I realize that's
>> > application specific, but maybe there are other magic bullets...
>> > Is there something like adding cache to the system keyspace? 800 ms is
>> > pretty bad, isn't it?
>> > See stats below and thanks.
>> >
>> > Keyspace: outbrain_kvdb
>> >         Read Count: 651668
>> >         Read Latency: 34.18622328547666 ms.
>> >         Write Count: 655542
>> >         Write Latency: 0.041145092152752985 ms.
>> >         Pending Tasks: 0
>> >                 Column Family: KvImpressions
>> >                 SSTable count: 13
>> >                 Space used (live): 23304548897
>> >                 Space used (total): 23304548897
>> >                 Memtable Columns Count: 895
>> >                 Memtable Data Size: 2108990
>> >                 Memtable Switch Count: 8
>> >                 Read Count: 468083
>> >                 Read Latency: 151.603 ms.
>> >                 Write Count: 552566
>> >                 Write Latency: 0.023 ms.
>> >                 Pending Tasks: 0
>> >                 Key cache capacity: 17398656
>> >                 Key cache size: 567967
>> >                 Key cache hit rate: 0.0
>> >                 Row cache: disabled
>> >                 Compacted row minimum size: 269
>> >                 Compacted row maximum size: 54501
>> >                 Compacted row mean size: 933
>> > ...
>> > ----------------
>> > Keyspace: system
>> >         Read Count: 1151
>> >         Read Latency: 872.5014448305822 ms.
>> >         Write Count: 51215
>> >         Write Latency: 0.07156788050375866 ms.
>> >         Pending Tasks: 0
>> >                 Column Family: HintsColumnFamily
>> >                 SSTable count: 5
>> >                 Space used (live): 437366878
>> >                 Space used (total): 437366878
>> >                 Memtable Columns Count: 14987
>> >                 Memtable Data Size: 87975
>> >                 Memtable Switch Count: 2
>> >                 Read Count: 1150
>> >                 Read Latency: NaN ms.
>> >                 Write Count: 51211
>> >                 Write Latency: 0.027 ms.
>> >                 Pending Tasks: 0
>> >                 Key cache capacity: 6
>> >                 Key cache size: 4
>> >                 Key cache hit rate: NaN
>> >                 Row cache: disabled
>> >                 Compacted row minimum size: 0
>> >                 Compacted row maximum size: 0
>> >                 Compacted row mean size: 0
>> >                 Column Family: LocationInfo
>> >                 SSTable count: 2
>> >                 Space used (live): 3504
>> >                 Space used (total): 3504
>> >                 Memtable Columns Count: 0
>> >                 Memtable Data Size: 0
>> >                 Memtable Switch Count: 1
>> >                 Read Count: 1
>> >                 Read Latency: NaN ms.
>> >                 Write Count: 7
>> >                 Write Latency: NaN ms.
>> >                 Pending Tasks: 0
>> >                 Key cache capacity: 2
>> >                 Key cache size: 1
>> >                 Key cache hit rate: NaN
>> >                 Row cache: disabled
>> >                 Compacted row minimum size: 0
>> >                 Compacted row maximum size: 0
>> >                 Compacted row mean size: 0
>> >
>> > On Tue, May 4, 2010 at 10:57 PM, Kyusik Chung <ky...@discovereads.com>
>> > wrote:
>> >>
>> >> Im using Ubuntu 8.04 on 64 bit hosts on rackspace cloud.
>> >>
>> >> Im in the middle of repeating some perf tests, but so far, I get
>> >> as-good
>> >> or slightly better read perf by using standard disk access mode vs
>> >> mmap.  So
>> >> far consecutive tests are returning consistent numbers.
>> >>
>> >> Im not sure how to explain it...maybe its an ubuntu 8.04 issue with
>> >> mmap.
>> >>  Back when I was using mmap, I was definitely seeing the kswapd0
>> >> process
>> >> start using cpu as the box ran out of memory, and read performance
>> >> significantly degraded.
>> >>
>> >> Next, Ill run some tests with mmap_index_only, and Ill test with heavy
>> >> concurrent writes as well as reads.  Ill let everyone know what I find.
>> >>
>> >> Kyusik Chung
>> >> CEO, Discovereads.com
>> >> kyusik@discovereads.com
>> >>
>> >> On May 4, 2010, at 12:27 PM, Jonathan Ellis wrote:
>> >>
>> >> > Are you using 32 bit hosts?  If not don't be scared of mmap using a
>> >> > lot of address space, you have plenty.  It won't make you swap more
>> >> > than using buffered i/o.
>> >> >
>> >> > On Tue, May 4, 2010 at 1:57 PM, Ran Tavory <ra...@gmail.com> wrote:
>> >> >> I canceled mmap and indeed memory usage is sane again. So far
>> >> >> performance
>> >> >> hasn't been great, but I'll wait and see.
>> >> >> I'm also interested in a way to cap mmap so I can take advantage of
>> >> >> it
>> >> >> but
>> >> >> not swap the host to death...
>> >> >>
>> >> >> On Tue, May 4, 2010 at 9:38 PM, Kyusik Chung
>> >> >> <ky...@discovereads.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> This sounds just like the slowness I was asking about in another
>> >> >>> thread -
>> >> >>> after a lot of reads, the machine uses up all available memory on
>> >> >>> the
>> >> >>> box
>> >> >>> and then starts swapping.
>> >> >>> My understanding was that mmap helps greatly with read and write
>> >> >>> perf
>> >> >>> (until the box starts swapping I guess)...is there any way to use
>> >> >>> mmap
>> >> >>> and
>> >> >>> cap how much memory it takes up?
>> >> >>> What do people use in production?  mmap or no mmap?
>> >> >>> Thanks!
>> >> >>> Kyusik Chung
>> >> >>> On May 4, 2010, at 10:11 AM, Schubert Zhang wrote:
>> >> >>>
>> >> >>> 1. When initially startup your nodes, please plan your InitialToken
>> >> >>> of
>> >> >>> each node evenly.
>> >> >>> 2. <DiskAccessMode>standard</DiskAccessMode>
>> >> >>>
>> >> >>> On Tue, May 4, 2010 at 9:09 PM, Boris Shulman <sh...@gmail.com>
>> >> >>> wrote:
>> >> >>>>
>> >> >>>> I think that the extra (more than 4GB) memory usage comes from the
>> >> >>>> mmaped io, that is why it happens only for reads.
>> >> >>>>
>> >> >>>> On Tue, May 4, 2010 at 2:02 PM, Jordan Pittier
>> >> >>>> <jo...@gmail.com>
>> >> >>>> wrote:
>> >> >>>>> I'm facing the same issue with swap. It only occurs when I
>> >> >>>>> perform
>> >> >>>>> read
>> >> >>>>> operations (write are very fast :)). So I can't help you with the
>> >> >>>>> memory
>> >> >>>>> probleme.
>> >> >>>>>
>> >> >>>>> But to balance the load evenly between nodes in cluster just
>> >> >>>>> manually
>> >> >>>>> fix
>> >> >>>>> their token.(the "formula" is i * 2^127 / nb_nodes).
>> >> >>>>>
>> >> >>>>> Jordzn
>> >> >>>>>
>> >> >>>>> On Tue, May 4, 2010 at 8:20 AM, Ran Tavory <ra...@gmail.com>
>> >> >>>>> wrote:
>> >> >>>>>>
>> >> >>>>>> I'm looking into performance issues on a 0.6.1 cluster. I see
>> >> >>>>>> two
>> >> >>>>>> symptoms:
>> >> >>>>>> 1. Reads and writes are slow
>> >> >>>>>> 2. One of the hosts is doing a lot of GC.
>> >> >>>>>> 1 is slow in the sense that in normal state the cluster used to
>> >> >>>>>> make
>> >> >>>>>> around 3-5k read and writes per second (6-10k operations per
>> >> >>>>>> second),
>> >> >>>>>> but
>> >> >>>>>> how it's in the order of 200-400 ops per second, sometimes even
>> >> >>>>>> less.
>> >> >>>>>> 2 looks like this:
>> >> >>>>>> $ tail -f /outbrain/cassandra/log/system.log
>> >> >>>>>>  INFO [GC inspection] 2010-05-04 00:42:18,636 GCInspector.java
>> >> >>>>>> (line
>> >> >>>>>> 110)
>> >> >>>>>> GC for ParNew: 672 ms, 166482384 reclaimed leaving 2872087208
>> >> >>>>>> used;
>> >> >>>>>> max is
>> >> >>>>>> 4432068608
>> >> >>>>>>  INFO [GC inspection] 2010-05-04 00:42:28,638 GCInspector.java
>> >> >>>>>> (line
>> >> >>>>>> 110)
>> >> >>>>>> GC for ParNew: 498 ms, 166493352 reclaimed leaving 2836049448
>> >> >>>>>> used;
>> >> >>>>>> max is
>> >> >>>>>> 4432068608
>> >> >>>>>>  INFO [GC inspection] 2010-05-04 00:42:38,640 GCInspector.java
>> >> >>>>>> (line
>> >> >>>>>> 110)
>> >> >>>>>> GC for ParNew: 327 ms, 166091528 reclaimed leaving 2796888424
>> >> >>>>>> used;
>> >> >>>>>> max is
>> >> >>>>>> 4432068608
>> >> >>>>>> ... and it goes on and on for hours, no stopping...
>> >> >>>>>> The cluster is made of 6 hosts, 3 in one DC and 3 in another.
>> >> >>>>>> Each host has 8G RAM.
>> >> >>>>>> -Xmx=4G
>> >> >>>>>> For some reason, the load isn't distributed evenly b/w the
>> >> >>>>>> hosts,
>> >> >>>>>> although
>> >> >>>>>> I'm not sure this is the cause for slowness
>> >> >>>>>> $ nodetool -h localhost -p 9004 ring
>> >> >>>>>> Address       Status     Load          Range
>> >> >>>>>>        Ring
>> >> >>>>>>
>> >> >>>>>> 144413773383729447702215082383444206680
>> >> >>>>>> 192.168.252.99Up         15.94 GB
>> >> >>>>>>  66002764663998929243644931915471302076     |<--|
>> >> >>>>>> 192.168.254.57Up         19.84 GB
>> >> >>>>>>  81288739225600737067856268063987022738     |   ^
>> >> >>>>>> 192.168.254.58Up         973.78 MB
>> >> >>>>>> 86999744104066390588161689990810839743     v   |
>> >> >>>>>> 192.168.252.62Up         5.18 GB
>> >> >>>>>> 88308919879653155454332084719458267849     |   ^
>> >> >>>>>> 192.168.254.59Up         10.57 GB
>> >> >>>>>>  142482163220375328195837946953175033937    v   |
>> >> >>>>>> 192.168.252.61Up         11.36 GB
>> >> >>>>>>  144413773383729447702215082383444206680    |-->|
>> >> >>>>>> The slow host is 192.168.252.61 and it isn't the most loaded
>> >> >>>>>> one.
>> >> >>>>>> The host is waiting a lot on IO and the load average is usually
>> >> >>>>>> 6-7
>> >> >>>>>> $ w
>> >> >>>>>>  00:42:56 up 11 days, 13:22,  1 user,  load average: 6.21, 5.52,
>> >> >>>>>> 3.93
>> >> >>>>>> $ vmstat 5
>> >> >>>>>> procs -----------memory---------- ---swap-- -----io----
>> >> >>>>>> --system--
>> >> >>>>>> -----cpu------
>> >> >>>>>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in
>> >> >>>>>> cs
>> >> >>>>>> us
>> >> >>>>>> sy id
>> >> >>>>>> wa st
>> >> >>>>>>  0  8 2147844  45744   1816 4457384    6    5    66    32    5
>> >> >>>>>>  2
>> >> >>>>>>  1
>> >> >>>>>>  1
>> >> >>>>>> 96  2  0
>> >> >>>>>>  0  8 2147164  49020   1808 4451596  385    0  2345    58 3372
>> >> >>>>>> 9957
>> >> >>>>>>  2
>> >> >>>>>>  2
>> >> >>>>>> 78 18  0
>> >> >>>>>>  0  3 2146432  45704   1812 4453956  342    0  2274   108 3937
>> >> >>>>>> 10732
>> >> >>>>>>  2  2
>> >> >>>>>> 78 19  0
>> >> >>>>>>  0  1 2146252  44696   1804 4453436  345  164  1939   294 3647
>> >> >>>>>> 7833
>> >> >>>>>>  2
>> >> >>>>>>  2
>> >> >>>>>> 78 18  0
>> >> >>>>>>  0  1 2145960  46924   1744 4451260  158    0  2423   122 4354
>> >> >>>>>> 14597
>> >> >>>>>>  2  2
>> >> >>>>>> 77 18  0
>> >> >>>>>>  7  1 2138344  44676    952 4504148 1722  403  1722   406 1388
>> >> >>>>>>  439
>> >> >>>>>> 87
>> >> >>>>>>  0
>> >> >>>>>> 10  2  0
>> >> >>>>>>  7  2 2137248  45652    956 4499436 1384  655  1384   658 1356
>> >> >>>>>>  392
>> >> >>>>>> 87
>> >> >>>>>>  0
>> >> >>>>>> 10  3  0
>> >> >>>>>>  7  1 2135976  46764    956 4495020 1366  718  1366   718 1395
>> >> >>>>>>  380
>> >> >>>>>> 87
>> >> >>>>>>  0
>> >> >>>>>>  9  4  0
>> >> >>>>>>  0  8 2134484  46964    956 4489420 1673  555  1814   586 1601
>> >> >>>>>> 215590
>> >> >>>>>> 14
>> >> >>>>>>  2 68 16  0
>> >> >>>>>>  0  1 2135388  47444    972 4488516  785  833  2390   995 3812
>> >> >>>>>> 8305
>> >> >>>>>>  2
>> >> >>>>>>  2
>> >> >>>>>> 77 20  0
>> >> >>>>>>  0 10 2135164  45928    980 4488796  788  543  2275   626 36
>> >> >>>>>> So, the host is swapping like crazy...
>> >> >>>>>> top shows that it's using a lot of memory. As noted before
>> >> >>>>>> -Xmx=4G
>> >> >>>>>> and
>> >> >>>>>> nothing else seems to be using a lot of memory on the host
>> >> >>>>>> except
>> >> >>>>>> for
>> >> >>>>>> the
>> >> >>>>>> cassandra process, however, of the 8G ram on the host, 92% is
>> >> >>>>>> used
>> >> >>>>>> by
>> >> >>>>>> cassandra. How's that?
>> >> >>>>>> Top shows there's 3.9g Shared and 7.2g Resident and 15.9g
>> >> >>>>>> Virtual.
>> >> >>>>>> Why
>> >> >>>>>> does it have 15g virtual? And why 7.2 RES? This can explain the
>> >> >>>>>> slowness in
>> >> >>>>>> swapping.
>> >> >>>>>> $ top
>> >> >>>>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
>> >> >>>>>>  COMMAND
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> 20281 cassandr  25   0 15.9g 7.2g 3.9g S 33.3 92.6 175:30.27
>> >> >>>>>> java
>> >> >>>>>> So, can the total memory be controlled?
>> >> >>>>>> Or perhaps I'm looking in the wrong direction...
>> >> >>>>>> I've looked at all the cassandra JMX counts and nothing seemed
>> >> >>>>>> suspicious
>> >> >>>>>> so far. By suspicious i mean a large number of pending tasks -
>> >> >>>>>> there
>> >> >>>>>> were
>> >> >>>>>> always very small numbers in each pool.
>> >> >>>>>> About read and write latencies, I'm not sure what the normal
>> >> >>>>>> state
>> >> >>>>>> is,
>> >> >>>>>> but
>> >> >>>>>> here's an example of what I see on the problematic host:
>> >> >>>>>> #mbean = org.apache.cassandra.service:type=StorageProxy:
>> >> >>>>>> RecentReadLatencyMicros = 30105.888180684495;
>> >> >>>>>> TotalReadLatencyMicros = 78543052801;
>> >> >>>>>> TotalWriteLatencyMicros = 4213118609;
>> >> >>>>>> RecentWriteLatencyMicros = 1444.4809201925639;
>> >> >>>>>> ReadOperations = 4779553;
>> >> >>>>>> RangeOperations = 0;
>> >> >>>>>> TotalRangeLatencyMicros = 0;
>> >> >>>>>> RecentRangeLatencyMicros = NaN;
>> >> >>>>>> WriteOperations = 4740093;
>> >> >>>>>> And the only pool that I do see some pending tasks is the
>> >> >>>>>> ROW-READ-STAGE,
>> >> >>>>>> but it doesn't look like much, usually around 6-8:
>> >> >>>>>> #mbean = org.apache.cassandra.concurrent:type=ROW-READ-STAGE:
>> >> >>>>>> ActiveCount = 8;
>> >> >>>>>> PendingTasks = 8;
>> >> >>>>>> CompletedTasks = 5427955;
>> >> >>>>>> Any help finding the solution is appreciated, thanks...
>> >> >>>>>> Below are a few more JMXes I collected from the system that may
>> >> >>>>>> be
>> >> >>>>>> interesting.
>> >> >>>>>> #mbean = java.lang:type=Memory:
>> >> >>>>>> Verbose = false;
>> >> >>>>>> HeapMemoryUsage = {
>> >> >>>>>>   committed = 3767279616;
>> >> >>>>>>   init = 134217728;
>> >> >>>>>>   max = 4293656576;
>> >> >>>>>>   used = 1237105080;
>> >> >>>>>>  };
>> >> >>>>>> NonHeapMemoryUsage = {
>> >> >>>>>>   committed = 35061760;
>> >> >>>>>>   init = 24313856;
>> >> >>>>>>   max = 138412032;
>> >> >>>>>>   used = 23151320;
>> >> >>>>>>  };
>> >> >>>>>> ObjectPendingFinalizationCount = 0;
>> >> >>>>>> #mbean = java.lang:name=ParNew,type=GarbageCollector:
>> >> >>>>>> LastGcInfo = {
>> >> >>>>>>   GcThreadCount = 11;
>> >> >>>>>>   duration = 136;
>> >> >>>>>>   endTime = 42219272;
>> >> >>>>>>   id = 11719;
>> >> >>>>>>   memoryUsageAfterGc = {
>> >> >>>>>>     ( CMS Perm Gen ) = {
>> >> >>>>>>       key = CMS Perm Gen;
>> >> >>>>>>       value = {
>> >> >>>>>>         committed = 29229056;
>> >> >>>>>>         init = 21757952;
>> >> >>>>>>         max = 88080384;
>> >> >>>>>>         used = 17648848;
>> >> >>>>>>        };
>> >> >>>>>>      };
>> >> >>>>>>     ( Code Cache ) = {
>> >> >>>>>>       key = Code Cache;
>> >> >>>>>>       value = {
>> >> >>>>>>         committed = 5832704;
>> >> >>>>>>         init = 2555904;
>> >> >>>>>>         max = 50331648;
>> >> >>>>>>         used = 5563520;
>> >> >>>>>>        };
>> >> >>>>>>      };
>> >> >>>>>>     ( CMS Old Gen ) = {
>> >> >>>>>>       key = CMS Old Gen;
>> >> >>>>>>       value = {
>> >> >>>>>>         committed = 3594133504;
>> >> >>>>>>         init = 112459776;
>> >> >>>>>>         max = 4120510464;
>> >> >>>>>>         used = 964565720;
>> >> >>>>>>        };
>> >> >>>>>>      };
>> >> >>>>>>     ( Par Eden Space ) = {
>> >> >>>>>>       key = Par Eden Space;
>> >> >>>>>>       value = {
>> >> >>>>>>         committed = 171835392;
>> >> >>>>>>         init = 21495808;
>> >> >>>>>>         max = 171835392;
>> >> >>>>>>         used = 0;
>> >> >>>>>>        };
>> >> >>>>>>      };
>> >> >>>>>>     ( Par Survivor Space ) = {
>> >> >>>>>>       key = Par Survivor Space;
>> >> >>>>>>       value = {
>> >> >>>>>>         committed = 1310720;
>> >> >>>>>>         init = 131072;
>> >> >>>>>>         max = 1310720;
>> >> >>>>>>         used = 0;
>> >> >>>>>>        };
>> >> >>>>>>      };
>> >> >>>>>>    };
>> >> >>>>>>   memoryUsageBeforeGc = {
>> >> >>>>>>     ( CMS Perm Gen ) = {
>> >> >>>>>>       key = CMS Perm Gen;
>> >> >>>>>>       value = {
>> >> >>>>>>         committed = 29229056;
>> >> >>>>>>         init = 21757952;
>> >> >>>>>>         max = 88080384;
>> >> >>>>>>         used = 17648848;
>> >> >>>>>>        };
>> >> >>>>>>      };
>> >> >>>>>>     ( Code Cache ) = {
>> >> >>>>>>       key = Code Cache;
>> >> >>>>>>       value = {
>> >> >>>>>>         committed = 5832704;
>> >> >>>>>>         init = 2555904;
>> >> >>>>>>         max = 50331648;
>> >> >>>>>>         used = 5563520;
>> >> >>>>>>        };
>> >> >>>>>>      };
>> >> >>>>>>     ( CMS Old Gen ) = {
>> >> >>>>>>       key = CMS Old Gen;
>> >> >>>>>>       value = {
>> >> >>>>>>         committed = 3594133504;
>> >> >>>>>>         init = 112459776;
>> >> >>>>>>         max = 4120510464;
>> >> >>>>>>         used = 959221872;
>> >> >>>>>>        };
>> >> >>>>>>      };
>> >> >>>>>>     ( Par Eden Space ) = {
>> >> >>>>>>       key = Par Eden Space;
>> >> >>>>>>       value = {
>> >> >>>>>>         committed = 171835392;
>> >> >>>>>>         init = 21495808;
>> >> >>>>>>         max = 171835392;
>> >> >>>>>>         used = 171835392;
>> >> >>>>>>        };
>> >> >>>>>>      };
>> >> >>>>>>     ( Par Survivor Space ) = {
>> >> >>>>>>       key = Par Survivor Space;
>> >> >>>>>>       value = {
>> >> >>>>>>         committed = 1310720;
>> >> >>>>>>         init = 131072;
>> >> >>>>>>         max = 1310720;
>> >> >>>>>>         used = 0;
>> >> >>>>>>        };
>> >> >>>>>>      };
>> >> >>>>>>    };
>> >> >>>>>>   startTime = 42219136;
>> >> >>>>>>  };
>> >> >>>>>> CollectionCount = 11720;
>> >> >>>>>> CollectionTime = 4561730;
>> >> >>>>>> Name = ParNew;
>> >> >>>>>> Valid = true;
>> >> >>>>>> MemoryPoolNames = [ Par Eden Space, Par Survivor Space ];
>> >> >>>>>> #mbean = java.lang:type=OperatingSystem:
>> >> >>>>>> MaxFileDescriptorCount = 63536;
>> >> >>>>>> OpenFileDescriptorCount = 75;
>> >> >>>>>> CommittedVirtualMemorySize = 17787711488;
>> >> >>>>>> FreePhysicalMemorySize = 45522944;
>> >> >>>>>> FreeSwapSpaceSize = 2123968512;
>> >> >>>>>> ProcessCpuTime = 12251460000000;
>> >> >>>>>> TotalPhysicalMemorySize = 8364417024;
>> >> >>>>>> TotalSwapSpaceSize = 4294959104;
>> >> >>>>>> Name = Linux;
>> >> >>>>>> AvailableProcessors = 8;
>> >> >>>>>> Arch = amd64;
>> >> >>>>>> SystemLoadAverage = 4.36;
>> >> >>>>>> Version = 2.6.18-164.15.1.el5;
>> >> >>>>>> #mbean = java.lang:type=Runtime:
>> >> >>>>>> Name = 20281@ob1061.nydc1.outbrain.com;
>> >> >>>>>>
>> >> >>>>>> ClassPath =
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> /outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../build/classes:/outbrain/cassandra/apache-cassandra-0.6.1/bin/..
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> /lib/antlr-3.1.3.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/apache-cassandra-0.6.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/avro-1.2.0-dev.jar:/outb
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> rain/cassandra/apache-cassandra-0.6.1/bin/../lib/clhm-production.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-cli-1.1.jar:/outbrain/cassandra/apache-cassandra-
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> 0.6.1/bin/../lib/commons-codec-1.2.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-collections-3.2.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/com
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> mons-lang-2.4.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/google-collections-1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/hadoop-core-0.20.1.jar:/out
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> brain/cassandra/apache-cassandra-0.6.1/bin/../lib/high-scale-lib.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/ivy-2.1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> bin/../lib/jackson-core-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jackson-mapper-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jline
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> -0.9.94.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/json-simple-1.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/libthrift-r917130.jar:/outbrain/cassandr
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> a/apache-cassandra-0.6.1/bin/../lib/log4j-1.2.14.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/slf4j-api-1.5.8.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib
>> >> >>>>>> /slf4j-log4j12-1.5.8.jar;
>> >> >>>>>>
>> >> >>>>>> BootClassPath =
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> /usr/java/jdk1.6.0_17/jre/lib/alt-rt.jar:/usr/java/jdk1.6.0_17/jre/lib/resources.jar:/usr/java/jdk1.6.0_17/jre/lib/rt.jar:/usr/java/jdk1.6.0_17/jre/lib/sunrsasign.j
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> ar:/usr/java/jdk1.6.0_17/jre/lib/jsse.jar:/usr/java/jdk1.6.0_17/jre/lib/jce.jar:/usr/java/jdk1.6.0_17/jre/lib/charsets.jar:/usr/java/jdk1.6.0_17/jre/classes;
>> >> >>>>>>
>> >> >>>>>> LibraryPath =
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> /usr/java/jdk1.6.0_17/jre/lib/amd64/server:/usr/java/jdk1.6.0_17/jre/lib/amd64:/usr/java/jdk1.6.0_17/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib;
>> >> >>>>>>
>> >> >>>>>> VmName = Java HotSpot(TM) 64-Bit Server VM;
>> >> >>>>>>
>> >> >>>>>> VmVendor = Sun Microsystems Inc.;
>> >> >>>>>>
>> >> >>>>>> VmVersion = 14.3-b01;
>> >> >>>>>>
>> >> >>>>>> BootClassPathSupported = true;
>> >> >>>>>>
>> >> >>>>>> InputArguments = [ -ea, -Xms128M, -Xmx4G,
>> >> >>>>>> -XX:TargetSurvivorRatio=90,
>> >> >>>>>> -XX:+AggressiveOpts, -XX:+UseParNewGC, -XX:+UseConcMarkSweepGC,
>> >> >>>>>> -XX:+CMSParallelRemarkEnabled, -XX:+HeapDumpOnOutOfMemoryError,
>> >> >>>>>> -XX:SurvivorRatio=128, -XX:MaxTenuringThreshold=0,
>> >> >>>>>> -Dcom.sun.management.jmxremote.port=9004,
>> >> >>>>>> -Dcom.sun.management.jmxremote.ssl=false,
>> >> >>>>>> -Dcom.sun.management.jmxremote.authenticate=false,
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> -Dstorage-config=/outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf,
>> >> >>>>>> -Dcassandra-pidfile=/var/run/cassandra.pid ];
>> >> >>>>>>
>> >> >>>>>> ManagementSpecVersion = 1.2;
>> >> >>>>>>
>> >> >>>>>> SpecName = Java Virtual Machine Specification;
>> >> >>>>>>
>> >> >>>>>> SpecVendor = Sun Microsystems Inc.;
>> >> >>>>>>
>> >> >>>>>> SpecVersion = 1.0;
>> >> >>>>>>
>> >> >>>>>> StartTime = 1272911001415;
>> >> >>>>>> ...
>> >> >>>>>
>> >> >>>
>> >> >>>
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Jonathan Ellis
>> >> > Project Chair, Apache Cassandra
>> >> > co-founder of Riptano, the source for professional Cassandra support
>> >> > http://riptano.com
>> >>
>> >
>> >
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of Riptano, the source for professional Cassandra support
>> http://riptano.com
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: performance tuning - where does the slowness come from?

Posted by Jonathan Ellis <jb...@gmail.com>.

Nope, that's exactly how it works in Java too.

On Thu, May 6, 2010 at 9:21 AM, Vick Khera <vi...@khera.org> wrote:
> On Wed, May 5, 2010 at 8:08 PM, Kyusik Chung <ky...@discovereads.com> wrote:
>> if the data from the sstables hasnt already been loaded into memory by mmap,
>> load it into memory; if you're out of memory on the box, swap some of the
>> old mmapped data out of memory
>
> mmap() does not copy your data into memory; it maps your virtual
> address space to the disk file, so you can treat the file as memory,
> usually an array of bytes or structures.  or is mmap on java different
> than that in C?
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: performance tuning - where does the slowness come from?

Posted by Vick Khera <vi...@khera.org>.

On Wed, May 5, 2010 at 8:08 PM, Kyusik Chung <ky...@discovereads.com> wrote:
> if the data from the sstables hasnt already been loaded into memory by mmap,
> load it into memory; if you're out of memory on the box, swap some of the
> old mmapped data out of memory

mmap() does not copy your data into memory; it maps your virtual
address space to the disk file, so you can treat the file as memory,
usually an array of bytes or structures.  or is mmap on java different
than that in C?

Re: performance tuning - where does the slowness come from?

Posted by Kyusik Chung <ky...@discovereads.com>.

Ive been running some tests, realizing that I have much to learn about cassandra, and heres where Ive gotten to:

I no longer believe mmap is a performance issue in certain scenarios.  If Im understanding how mmap works, then this is what is happening:

request a read on a key and column
if the data from the sstables hasnt already been loaded into memory by mmap, load it into memory; if you're out of memory on the box, swap some of the old mmapped data out of memory
read the data and return the value to the client

The issue I believe I was running into earlier when using mmap was that the amount of data that was trying to be pulled into memory for reads was far larger than the physical memory on the box, and I was simulating completely random key accesses.  Therefore, once I had used up all the memory on the box, mmap was constantly swapping data out of memory, causing a major slow down.  Fortunately, completely random key access is not a realistic access pattern for us - most likely, we will have time periods of repeated accesses of a small percentage of our keys.  Ive tested that with this kind of access, mmap is a nice way to keep that data cached in memory.

In terms of read latency....Im getting values roughly similar to Ran.  ~30ms to read a single column from a single row.  We've got about 50MM == (keys * columns) in our test dataset.

Ive tried row-based caching, and its not clear that I get much better perf than mmap, even when the cache hit rate is well over 80%.

Im using 2x num CPU cores for concurrent reads....havent tried increasing that to a large value like Ran suggests...may try that soon.

Kyusik Chung

On May 5, 2010, at 4:36 PM, Mark Jones wrote:

> Have you actually managed to get 10K reads/second, or are you just estimating that you can?  I’ve run into similar issues, but I never got reads to scale when searching for unique keys even using 40 threads, I did discover that using 80+ threads, I can actually reduce performance.  I’ve never gotten more than 200-300 reads/second (steady state) off a 4 cluster node.  I can get roughly 8K writes/second to the same cluster (although I haven’t tested both simultaneously with results worth talking about).
>  
> From: Ran Tavory [mailto:rantav@gmail.com] 
> Sent: Wednesday, May 05, 2010 4:59 PM
> To: user@cassandra.apache.org
> Subject: Re: performance tuning - where does the slowness come from?
>  
> let's see if I can make some assertions, feel free to correct me...
>  
> Well, obviously, reads are much slower in cassandra than writes, everyone knows that, but by which factor?
> In my case I read/write only one column at a time. Key, column and value are pretty small (< 200b)
> So the numbers are usually - Write Latency: ~0.05ms, Read Latency: ~30ms. So it looks like reads are 60x slower, at least on my hardware. This happens when cache is cold. If cache is warm reads are better, but unfortunately my cache is usually cold... 
> If my application keeps reading a cold cache there's nothing I can do to make reads faster from cassandra's side. With one client thread this implies 1000/30=33 reads/sec, not great.
> However, although read latency is a bottleneck, read throughput isn't. So I need to add more reading threads and I can actually add many of them before read latency starts draining, according to ycsb I can have ~10000 reads/sec (on the cluster they tested, numbers may vary) before read latency starts draining. So, numbers may vary by cluster size, hardware, data size etc, but the idea is - if read latency is 30ms and cache is a miss most of the time, that's normal, just add more reader threads and you get a better throughput.
> Sorry if this sounds trivial, I was just trying to improve on the 30ms reads until I realized I actually can't... 
>  
> On Wed, May 5, 2010 at 7:08 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>  - your key cache isn't warm.  capacity 17M, size 0.5M, 468083 reads
> sounds like most of your reads have been for unique keys.
>  - the kind of reads you are doing can have a big effect (mostly
> number of columns you are asking for).  column index granularity plays
> a role (for non-rowcached reads); so can column comparator (see e.g.
> https://issues.apache.org/jira/browse/CASSANDRA-1043)
>  - the slow system reads are all on HH rows, which can get very wide
> (hence, slow to read the whole row, which is what the HH code does).
> clean those out either by bringing back the nodes it's hinting for, or
> just removing the HH data files.
> 
> On Wed, May 5, 2010 at 10:19 AM, Ran Tavory <ra...@gmail.com> wrote:
> > I'm still trying to figure out where my slowness is coming from...
> > By now I'm pretty sure it's the reads are slow, but not sure how to improve
> > them.
> > I'm looking at cfstats. Can you say if there are better configuration
> > options? So far I've used all default settings, except for:
> >     <Keyspace Name="outbrain_kvdb">
> >       <ColumnFamily CompareWith="BytesType" Name="KvImpressions"
> > KeysCached="50%"/>
> >
> >  <ReplicaPlacementStrategy>org.apache.cassandra.locator.RackAwareStrategy</ReplicaPlacementStrategy>
> >       <ReplicationFactor>2</ReplicationFactor>
> >
> >  <EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch>
> >     </Keyspace>
> >
> > What does a good read latency look like? I was expecting 10ms, however so
> > far it seems that my KvImpressions read latency is 30ms and in the system
> > keyspace I have 800ms :(
> > I thought adding KeysCached="50%" would improve my situation but
> > unfortunately looks like the hitrate is about 0. I realize that's
> > application specific, but maybe there are other magic bullets...
> > Is there something like adding cache to the system keyspace? 800 ms is
> > pretty bad, isn't it?
> > See stats below and thanks.
> >
> > Keyspace: outbrain_kvdb
> >         Read Count: 651668
> >         Read Latency: 34.18622328547666 ms.
> >         Write Count: 655542
> >         Write Latency: 0.041145092152752985 ms.
> >         Pending Tasks: 0
> >                 Column Family: KvImpressions
> >                 SSTable count: 13
> >                 Space used (live): 23304548897
> >                 Space used (total): 23304548897
> >                 Memtable Columns Count: 895
> >                 Memtable Data Size: 2108990
> >                 Memtable Switch Count: 8
> >                 Read Count: 468083
> >                 Read Latency: 151.603 ms.
> >                 Write Count: 552566
> >                 Write Latency: 0.023 ms.
> >                 Pending Tasks: 0
> >                 Key cache capacity: 17398656
> >                 Key cache size: 567967
> >                 Key cache hit rate: 0.0
> >                 Row cache: disabled
> >                 Compacted row minimum size: 269
> >                 Compacted row maximum size: 54501
> >                 Compacted row mean size: 933
> > ...
> > ----------------
> > Keyspace: system
> >         Read Count: 1151
> >         Read Latency: 872.5014448305822 ms.
> >         Write Count: 51215
> >         Write Latency: 0.07156788050375866 ms.
> >         Pending Tasks: 0
> >                 Column Family: HintsColumnFamily
> >                 SSTable count: 5
> >                 Space used (live): 437366878
> >                 Space used (total): 437366878
> >                 Memtable Columns Count: 14987
> >                 Memtable Data Size: 87975
> >                 Memtable Switch Count: 2
> >                 Read Count: 1150
> >                 Read Latency: NaN ms.
> >                 Write Count: 51211
> >                 Write Latency: 0.027 ms.
> >                 Pending Tasks: 0
> >                 Key cache capacity: 6
> >                 Key cache size: 4
> >                 Key cache hit rate: NaN
> >                 Row cache: disabled
> >                 Compacted row minimum size: 0
> >                 Compacted row maximum size: 0
> >                 Compacted row mean size: 0
> >                 Column Family: LocationInfo
> >                 SSTable count: 2
> >                 Space used (live): 3504
> >                 Space used (total): 3504
> >                 Memtable Columns Count: 0
> >                 Memtable Data Size: 0
> >                 Memtable Switch Count: 1
> >                 Read Count: 1
> >                 Read Latency: NaN ms.
> >                 Write Count: 7
> >                 Write Latency: NaN ms.
> >                 Pending Tasks: 0
> >                 Key cache capacity: 2
> >                 Key cache size: 1
> >                 Key cache hit rate: NaN
> >                 Row cache: disabled
> >                 Compacted row minimum size: 0
> >                 Compacted row maximum size: 0
> >                 Compacted row mean size: 0
> >
> > On Tue, May 4, 2010 at 10:57 PM, Kyusik Chung <ky...@discovereads.com>
> > wrote:
> >>
> >> Im using Ubuntu 8.04 on 64 bit hosts on rackspace cloud.
> >>
> >> Im in the middle of repeating some perf tests, but so far, I get as-good
> >> or slightly better read perf by using standard disk access mode vs mmap.  So
> >> far consecutive tests are returning consistent numbers.
> >>
> >> Im not sure how to explain it...maybe its an ubuntu 8.04 issue with mmap.
> >>  Back when I was using mmap, I was definitely seeing the kswapd0 process
> >> start using cpu as the box ran out of memory, and read performance
> >> significantly degraded.
> >>
> >> Next, Ill run some tests with mmap_index_only, and Ill test with heavy
> >> concurrent writes as well as reads.  Ill let everyone know what I find.
> >>
> >> Kyusik Chung
> >> CEO, Discovereads.com
> >> kyusik@discovereads.com
> >>
> >> On May 4, 2010, at 12:27 PM, Jonathan Ellis wrote:
> >>
> >> > Are you using 32 bit hosts?  If not don't be scared of mmap using a
> >> > lot of address space, you have plenty.  It won't make you swap more
> >> > than using buffered i/o.
> >> >
> >> > On Tue, May 4, 2010 at 1:57 PM, Ran Tavory <ra...@gmail.com> wrote:
> >> >> I canceled mmap and indeed memory usage is sane again. So far
> >> >> performance
> >> >> hasn't been great, but I'll wait and see.
> >> >> I'm also interested in a way to cap mmap so I can take advantage of it
> >> >> but
> >> >> not swap the host to death...
> >> >>
> >> >> On Tue, May 4, 2010 at 9:38 PM, Kyusik Chung <ky...@discovereads.com>
> >> >> wrote:
> >> >>>
> >> >>> This sounds just like the slowness I was asking about in another
> >> >>> thread -
> >> >>> after a lot of reads, the machine uses up all available memory on the
> >> >>> box
> >> >>> and then starts swapping.
> >> >>> My understanding was that mmap helps greatly with read and write perf
> >> >>> (until the box starts swapping I guess)...is there any way to use mmap
> >> >>> and
> >> >>> cap how much memory it takes up?
> >> >>> What do people use in production?  mmap or no mmap?
> >> >>> Thanks!
> >> >>> Kyusik Chung
> >> >>> On May 4, 2010, at 10:11 AM, Schubert Zhang wrote:
> >> >>>
> >> >>> 1. When initially startup your nodes, please plan your InitialToken of
> >> >>> each node evenly.
> >> >>> 2. <DiskAccessMode>standard</DiskAccessMode>
> >> >>>
> >> >>> On Tue, May 4, 2010 at 9:09 PM, Boris Shulman <sh...@gmail.com>
> >> >>> wrote:
> >> >>>>
> >> >>>> I think that the extra (more than 4GB) memory usage comes from the
> >> >>>> mmaped io, that is why it happens only for reads.
> >> >>>>
> >> >>>> On Tue, May 4, 2010 at 2:02 PM, Jordan Pittier
> >> >>>> <jo...@gmail.com>
> >> >>>> wrote:
> >> >>>>> I'm facing the same issue with swap. It only occurs when I perform
> >> >>>>> read
> >> >>>>> operations (write are very fast :)). So I can't help you with the
> >> >>>>> memory
> >> >>>>> probleme.
> >> >>>>>
> >> >>>>> But to balance the load evenly between nodes in cluster just
> >> >>>>> manually
> >> >>>>> fix
> >> >>>>> their token.(the "formula" is i * 2^127 / nb_nodes).
> >> >>>>>
> >> >>>>> Jordzn
> >> >>>>>
> >> >>>>> On Tue, May 4, 2010 at 8:20 AM, Ran Tavory <ra...@gmail.com> wrote:
> >> >>>>>>
> >> >>>>>> I'm looking into performance issues on a 0.6.1 cluster. I see two
> >> >>>>>> symptoms:
> >> >>>>>> 1. Reads and writes are slow
> >> >>>>>> 2. One of the hosts is doing a lot of GC.
> >> >>>>>> 1 is slow in the sense that in normal state the cluster used to
> >> >>>>>> make
> >> >>>>>> around 3-5k read and writes per second (6-10k operations per
> >> >>>>>> second),
> >> >>>>>> but
> >> >>>>>> how it's in the order of 200-400 ops per second, sometimes even
> >> >>>>>> less.
> >> >>>>>> 2 looks like this:
> >> >>>>>> $ tail -f /outbrain/cassandra/log/system.log
> >> >>>>>>  INFO [GC inspection] 2010-05-04 00:42:18,636 GCInspector.java
> >> >>>>>> (line
> >> >>>>>> 110)
> >> >>>>>> GC for ParNew: 672 ms, 166482384 reclaimed leaving 2872087208 used;
> >> >>>>>> max is
> >> >>>>>> 4432068608
> >> >>>>>>  INFO [GC inspection] 2010-05-04 00:42:28,638 GCInspector.java
> >> >>>>>> (line
> >> >>>>>> 110)
> >> >>>>>> GC for ParNew: 498 ms, 166493352 reclaimed leaving 2836049448 used;
> >> >>>>>> max is
> >> >>>>>> 4432068608
> >> >>>>>>  INFO [GC inspection] 2010-05-04 00:42:38,640 GCInspector.java
> >> >>>>>> (line
> >> >>>>>> 110)
> >> >>>>>> GC for ParNew: 327 ms, 166091528 reclaimed leaving 2796888424 used;
> >> >>>>>> max is
> >> >>>>>> 4432068608
> >> >>>>>> ... and it goes on and on for hours, no stopping...
> >> >>>>>> The cluster is made of 6 hosts, 3 in one DC and 3 in another.
> >> >>>>>> Each host has 8G RAM.
> >> >>>>>> -Xmx=4G
> >> >>>>>> For some reason, the load isn't distributed evenly b/w the hosts,
> >> >>>>>> although
> >> >>>>>> I'm not sure this is the cause for slowness
> >> >>>>>> $ nodetool -h localhost -p 9004 ring
> >> >>>>>> Address       Status     Load          Range
> >> >>>>>>        Ring
> >> >>>>>>
> >> >>>>>> 144413773383729447702215082383444206680
> >> >>>>>> 192.168.252.99Up         15.94 GB
> >> >>>>>>  66002764663998929243644931915471302076     |<--|
> >> >>>>>> 192.168.254.57Up         19.84 GB
> >> >>>>>>  81288739225600737067856268063987022738     |   ^
> >> >>>>>> 192.168.254.58Up         973.78 MB
> >> >>>>>> 86999744104066390588161689990810839743     v   |
> >> >>>>>> 192.168.252.62Up         5.18 GB
> >> >>>>>> 88308919879653155454332084719458267849     |   ^
> >> >>>>>> 192.168.254.59Up         10.57 GB
> >> >>>>>>  142482163220375328195837946953175033937    v   |
> >> >>>>>> 192.168.252.61Up         11.36 GB
> >> >>>>>>  144413773383729447702215082383444206680    |-->|
> >> >>>>>> The slow host is 192.168.252.61 and it isn't the most loaded one.
> >> >>>>>> The host is waiting a lot on IO and the load average is usually 6-7
> >> >>>>>> $ w
> >> >>>>>>  00:42:56 up 11 days, 13:22,  1 user,  load average: 6.21, 5.52,
> >> >>>>>> 3.93
> >> >>>>>> $ vmstat 5
> >> >>>>>> procs -----------memory---------- ---swap-- -----io---- --system--
> >> >>>>>> -----cpu------
> >> >>>>>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs
> >> >>>>>> us
> >> >>>>>> sy id
> >> >>>>>> wa st
> >> >>>>>>  0  8 2147844  45744   1816 4457384    6    5    66    32    5    2
> >> >>>>>>  1
> >> >>>>>>  1
> >> >>>>>> 96  2  0
> >> >>>>>>  0  8 2147164  49020   1808 4451596  385    0  2345    58 3372 9957
> >> >>>>>>  2
> >> >>>>>>  2
> >> >>>>>> 78 18  0
> >> >>>>>>  0  3 2146432  45704   1812 4453956  342    0  2274   108 3937
> >> >>>>>> 10732
> >> >>>>>>  2  2
> >> >>>>>> 78 19  0
> >> >>>>>>  0  1 2146252  44696   1804 4453436  345  164  1939   294 3647 7833
> >> >>>>>>  2
> >> >>>>>>  2
> >> >>>>>> 78 18  0
> >> >>>>>>  0  1 2145960  46924   1744 4451260  158    0  2423   122 4354
> >> >>>>>> 14597
> >> >>>>>>  2  2
> >> >>>>>> 77 18  0
> >> >>>>>>  7  1 2138344  44676    952 4504148 1722  403  1722   406 1388  439
> >> >>>>>> 87
> >> >>>>>>  0
> >> >>>>>> 10  2  0
> >> >>>>>>  7  2 2137248  45652    956 4499436 1384  655  1384   658 1356  392
> >> >>>>>> 87
> >> >>>>>>  0
> >> >>>>>> 10  3  0
> >> >>>>>>  7  1 2135976  46764    956 4495020 1366  718  1366   718 1395  380
> >> >>>>>> 87
> >> >>>>>>  0
> >> >>>>>>  9  4  0
> >> >>>>>>  0  8 2134484  46964    956 4489420 1673  555  1814   586 1601
> >> >>>>>> 215590
> >> >>>>>> 14
> >> >>>>>>  2 68 16  0
> >> >>>>>>  0  1 2135388  47444    972 4488516  785  833  2390   995 3812 8305
> >> >>>>>>  2
> >> >>>>>>  2
> >> >>>>>> 77 20  0
> >> >>>>>>  0 10 2135164  45928    980 4488796  788  543  2275   626 36
> >> >>>>>> So, the host is swapping like crazy...
> >> >>>>>> top shows that it's using a lot of memory. As noted before -Xmx=4G
> >> >>>>>> and
> >> >>>>>> nothing else seems to be using a lot of memory on the host except
> >> >>>>>> for
> >> >>>>>> the
> >> >>>>>> cassandra process, however, of the 8G ram on the host, 92% is used
> >> >>>>>> by
> >> >>>>>> cassandra. How's that?
> >> >>>>>> Top shows there's 3.9g Shared and 7.2g Resident and 15.9g Virtual.
> >> >>>>>> Why
> >> >>>>>> does it have 15g virtual? And why 7.2 RES? This can explain the
> >> >>>>>> slowness in
> >> >>>>>> swapping.
> >> >>>>>> $ top
> >> >>>>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
> >> >>>>>>  COMMAND
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> 20281 cassandr  25   0 15.9g 7.2g 3.9g S 33.3 92.6 175:30.27 java
> >> >>>>>> So, can the total memory be controlled?
> >> >>>>>> Or perhaps I'm looking in the wrong direction...
> >> >>>>>> I've looked at all the cassandra JMX counts and nothing seemed
> >> >>>>>> suspicious
> >> >>>>>> so far. By suspicious i mean a large number of pending tasks -
> >> >>>>>> there
> >> >>>>>> were
> >> >>>>>> always very small numbers in each pool.
> >> >>>>>> About read and write latencies, I'm not sure what the normal state
> >> >>>>>> is,
> >> >>>>>> but
> >> >>>>>> here's an example of what I see on the problematic host:
> >> >>>>>> #mbean = org.apache.cassandra.service:type=StorageProxy:
> >> >>>>>> RecentReadLatencyMicros = 30105.888180684495;
> >> >>>>>> TotalReadLatencyMicros = 78543052801;
> >> >>>>>> TotalWriteLatencyMicros = 4213118609;
> >> >>>>>> RecentWriteLatencyMicros = 1444.4809201925639;
> >> >>>>>> ReadOperations = 4779553;
> >> >>>>>> RangeOperations = 0;
> >> >>>>>> TotalRangeLatencyMicros = 0;
> >> >>>>>> RecentRangeLatencyMicros = NaN;
> >> >>>>>> WriteOperations = 4740093;
> >> >>>>>> And the only pool that I do see some pending tasks is the
> >> >>>>>> ROW-READ-STAGE,
> >> >>>>>> but it doesn't look like much, usually around 6-8:
> >> >>>>>> #mbean = org.apache.cassandra.concurrent:type=ROW-READ-STAGE:
> >> >>>>>> ActiveCount = 8;
> >> >>>>>> PendingTasks = 8;
> >> >>>>>> CompletedTasks = 5427955;
> >> >>>>>> Any help finding the solution is appreciated, thanks...
> >> >>>>>> Below are a few more JMXes I collected from the system that may be
> >> >>>>>> interesting.
> >> >>>>>> #mbean = java.lang:type=Memory:
> >> >>>>>> Verbose = false;
> >> >>>>>> HeapMemoryUsage = {
> >> >>>>>>   committed = 3767279616;
> >> >>>>>>   init = 134217728;
> >> >>>>>>   max = 4293656576;
> >> >>>>>>   used = 1237105080;
> >> >>>>>>  };
> >> >>>>>> NonHeapMemoryUsage = {
> >> >>>>>>   committed = 35061760;
> >> >>>>>>   init = 24313856;
> >> >>>>>>   max = 138412032;
> >> >>>>>>   used = 23151320;
> >> >>>>>>  };
> >> >>>>>> ObjectPendingFinalizationCount = 0;
> >> >>>>>> #mbean = java.lang:name=ParNew,type=GarbageCollector:
> >> >>>>>> LastGcInfo = {
> >> >>>>>>   GcThreadCount = 11;
> >> >>>>>>   duration = 136;
> >> >>>>>>   endTime = 42219272;
> >> >>>>>>   id = 11719;
> >> >>>>>>   memoryUsageAfterGc = {
> >> >>>>>>     ( CMS Perm Gen ) = {
> >> >>>>>>       key = CMS Perm Gen;
> >> >>>>>>       value = {
> >> >>>>>>         committed = 29229056;
> >> >>>>>>         init = 21757952;
> >> >>>>>>         max = 88080384;
> >> >>>>>>         used = 17648848;
> >> >>>>>>        };
> >> >>>>>>      };
> >> >>>>>>     ( Code Cache ) = {
> >> >>>>>>       key = Code Cache;
> >> >>>>>>       value = {
> >> >>>>>>         committed = 5832704;
> >> >>>>>>         init = 2555904;
> >> >>>>>>         max = 50331648;
> >> >>>>>>         used = 5563520;
> >> >>>>>>        };
> >> >>>>>>      };
> >> >>>>>>     ( CMS Old Gen ) = {
> >> >>>>>>       key = CMS Old Gen;
> >> >>>>>>       value = {
> >> >>>>>>         committed = 3594133504;
> >> >>>>>>         init = 112459776;
> >> >>>>>>         max = 4120510464;
> >> >>>>>>         used = 964565720;
> >> >>>>>>        };
> >> >>>>>>      };
> >> >>>>>>     ( Par Eden Space ) = {
> >> >>>>>>       key = Par Eden Space;
> >> >>>>>>       value = {
> >> >>>>>>         committed = 171835392;
> >> >>>>>>         init = 21495808;
> >> >>>>>>         max = 171835392;
> >> >>>>>>         used = 0;
> >> >>>>>>        };
> >> >>>>>>      };
> >> >>>>>>     ( Par Survivor Space ) = {
> >> >>>>>>       key = Par Survivor Space;
> >> >>>>>>       value = {
> >> >>>>>>         committed = 1310720;
> >> >>>>>>         init = 131072;
> >> >>>>>>         max = 1310720;
> >> >>>>>>         used = 0;
> >> >>>>>>        };
> >> >>>>>>      };
> >> >>>>>>    };
> >> >>>>>>   memoryUsageBeforeGc = {
> >> >>>>>>     ( CMS Perm Gen ) = {
> >> >>>>>>       key = CMS Perm Gen;
> >> >>>>>>       value = {
> >> >>>>>>         committed = 29229056;
> >> >>>>>>         init = 21757952;
> >> >>>>>>         max = 88080384;
> >> >>>>>>         used = 17648848;
> >> >>>>>>        };
> >> >>>>>>      };
> >> >>>>>>     ( Code Cache ) = {
> >> >>>>>>       key = Code Cache;
> >> >>>>>>       value = {
> >> >>>>>>         committed = 5832704;
> >> >>>>>>         init = 2555904;
> >> >>>>>>         max = 50331648;
> >> >>>>>>         used = 5563520;
> >> >>>>>>        };
> >> >>>>>>      };
> >> >>>>>>     ( CMS Old Gen ) = {
> >> >>>>>>       key = CMS Old Gen;
> >> >>>>>>       value = {
> >> >>>>>>         committed = 3594133504;
> >> >>>>>>         init = 112459776;
> >> >>>>>>         max = 4120510464;
> >> >>>>>>         used = 959221872;
> >> >>>>>>        };
> >> >>>>>>      };
> >> >>>>>>     ( Par Eden Space ) = {
> >> >>>>>>       key = Par Eden Space;
> >> >>>>>>       value = {
> >> >>>>>>         committed = 171835392;
> >> >>>>>>         init = 21495808;
> >> >>>>>>         max = 171835392;
> >> >>>>>>         used = 171835392;
> >> >>>>>>        };
> >> >>>>>>      };
> >> >>>>>>     ( Par Survivor Space ) = {
> >> >>>>>>       key = Par Survivor Space;
> >> >>>>>>       value = {
> >> >>>>>>         committed = 1310720;
> >> >>>>>>         init = 131072;
> >> >>>>>>         max = 1310720;
> >> >>>>>>         used = 0;
> >> >>>>>>        };
> >> >>>>>>      };
> >> >>>>>>    };
> >> >>>>>>   startTime = 42219136;
> >> >>>>>>  };
> >> >>>>>> CollectionCount = 11720;
> >> >>>>>> CollectionTime = 4561730;
> >> >>>>>> Name = ParNew;
> >> >>>>>> Valid = true;
> >> >>>>>> MemoryPoolNames = [ Par Eden Space, Par Survivor Space ];
> >> >>>>>> #mbean = java.lang:type=OperatingSystem:
> >> >>>>>> MaxFileDescriptorCount = 63536;
> >> >>>>>> OpenFileDescriptorCount = 75;
> >> >>>>>> CommittedVirtualMemorySize = 17787711488;
> >> >>>>>> FreePhysicalMemorySize = 45522944;
> >> >>>>>> FreeSwapSpaceSize = 2123968512;
> >> >>>>>> ProcessCpuTime = 12251460000000;
> >> >>>>>> TotalPhysicalMemorySize = 8364417024;
> >> >>>>>> TotalSwapSpaceSize = 4294959104;
> >> >>>>>> Name = Linux;
> >> >>>>>> AvailableProcessors = 8;
> >> >>>>>> Arch = amd64;
> >> >>>>>> SystemLoadAverage = 4.36;
> >> >>>>>> Version = 2.6.18-164.15.1.el5;
> >> >>>>>> #mbean = java.lang:type=Runtime:
> >> >>>>>> Name = 20281@ob1061.nydc1.outbrain.com;
> >> >>>>>>
> >> >>>>>> ClassPath =
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> /outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../build/classes:/outbrain/cassandra/apache-cassandra-0.6.1/bin/..
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> /lib/antlr-3.1.3.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/apache-cassandra-0.6.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/avro-1.2.0-dev.jar:/outb
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> rain/cassandra/apache-cassandra-0.6.1/bin/../lib/clhm-production.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-cli-1.1.jar:/outbrain/cassandra/apache-cassandra-
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> 0.6.1/bin/../lib/commons-codec-1.2.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-collections-3.2.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/com
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> mons-lang-2.4.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/google-collections-1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/hadoop-core-0.20.1.jar:/out
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> brain/cassandra/apache-cassandra-0.6.1/bin/../lib/high-scale-lib.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/ivy-2.1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> bin/../lib/jackson-core-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jackson-mapper-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jline
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> -0.9.94.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/json-simple-1.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/libthrift-r917130.jar:/outbrain/cassandr
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> a/apache-cassandra-0.6.1/bin/../lib/log4j-1.2.14.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/slf4j-api-1.5.8.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib
> >> >>>>>> /slf4j-log4j12-1.5.8.jar;
> >> >>>>>>
> >> >>>>>> BootClassPath =
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> /usr/java/jdk1.6.0_17/jre/lib/alt-rt.jar:/usr/java/jdk1.6.0_17/jre/lib/resources.jar:/usr/java/jdk1.6.0_17/jre/lib/rt.jar:/usr/java/jdk1.6.0_17/jre/lib/sunrsasign.j
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> ar:/usr/java/jdk1.6.0_17/jre/lib/jsse.jar:/usr/java/jdk1.6.0_17/jre/lib/jce.jar:/usr/java/jdk1.6.0_17/jre/lib/charsets.jar:/usr/java/jdk1.6.0_17/jre/classes;
> >> >>>>>>
> >> >>>>>> LibraryPath =
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> /usr/java/jdk1.6.0_17/jre/lib/amd64/server:/usr/java/jdk1.6.0_17/jre/lib/amd64:/usr/java/jdk1.6.0_17/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib;
> >> >>>>>>
> >> >>>>>> VmName = Java HotSpot(TM) 64-Bit Server VM;
> >> >>>>>>
> >> >>>>>> VmVendor = Sun Microsystems Inc.;
> >> >>>>>>
> >> >>>>>> VmVersion = 14.3-b01;
> >> >>>>>>
> >> >>>>>> BootClassPathSupported = true;
> >> >>>>>>
> >> >>>>>> InputArguments = [ -ea, -Xms128M, -Xmx4G,
> >> >>>>>> -XX:TargetSurvivorRatio=90,
> >> >>>>>> -XX:+AggressiveOpts, -XX:+UseParNewGC, -XX:+UseConcMarkSweepGC,
> >> >>>>>> -XX:+CMSParallelRemarkEnabled, -XX:+HeapDumpOnOutOfMemoryError,
> >> >>>>>> -XX:SurvivorRatio=128, -XX:MaxTenuringThreshold=0,
> >> >>>>>> -Dcom.sun.management.jmxremote.port=9004,
> >> >>>>>> -Dcom.sun.management.jmxremote.ssl=false,
> >> >>>>>> -Dcom.sun.management.jmxremote.authenticate=false,
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> -Dstorage-config=/outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf,
> >> >>>>>> -Dcassandra-pidfile=/var/run/cassandra.pid ];
> >> >>>>>>
> >> >>>>>> ManagementSpecVersion = 1.2;
> >> >>>>>>
> >> >>>>>> SpecName = Java Virtual Machine Specification;
> >> >>>>>>
> >> >>>>>> SpecVendor = Sun Microsystems Inc.;
> >> >>>>>>
> >> >>>>>> SpecVersion = 1.0;
> >> >>>>>>
> >> >>>>>> StartTime = 1272911001415;
> >> >>>>>> ...
> >> >>>>>
> >> >>>
> >> >>>
> >> >>
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Jonathan Ellis
> >> > Project Chair, Apache Cassandra
> >> > co-founder of Riptano, the source for professional Cassandra support
> >> > http://riptano.com
> >>
> >
> >
> 
> 
> 
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>

RE: performance tuning - where does the slowness come from?

Posted by Mark Jones <MJ...@imagehawk.com>.

~ 70 million keys (20 bytes each using Random Partitioner)  1.4GB of key data + the structures to support it. Which seems a good bit smaller than the 32GB of RAM available on the 4 machines.  How many machines should it take to 2-3000 lookups/second?

From: Brandon Williams [mailto:driftx@gmail.com]
Sent: Wednesday, May 05, 2010 7:04 PM
To: user@cassandra.apache.org
Subject: Re: performance tuning - where does the slowness come from?

On Wed, May 5, 2010 at 6:59 PM, Mark Jones <MJ...@imagehawk.com>> wrote:
My data is single row/key to a 500 byte column and I'm reading ALL random keys (worst case read scenario)  Cache has minimal effectiveness, so the Bloom trees and indexes are getting a real work out.  I'm on 8GB Ubuntu 9.10 boxes (64bit).  Yea, I was griping about the performance earlier, disk is heavily used by Cassandra, so outside of going to some highend SAS stuff, not sure what to do.

How many keys?  If your data size is exceeding your OS's cache capacity (8GB - JVM size) then a completely random read pattern is mostly going to test how fast your disk can seek.  You can try to use faster disks, but the better solution is to add more nodes.

-Brandon

Re: performance tuning - where does the slowness come from?

Posted by Brandon Williams <dr...@gmail.com>.

On Wed, May 5, 2010 at 6:59 PM, Mark Jones <MJ...@imagehawk.com> wrote:

>  My data is single row/key to a 500 byte column and I’m reading ALL random
> keys (worst case read scenario)  Cache has minimal effectiveness, so the
> Bloom trees and indexes are getting a real work out.  I’m on 8GB Ubuntu 9.10
> boxes (64bit).  Yea, I was griping about the performance earlier, disk is
> heavily used by Cassandra, so outside of going to some highend SAS stuff,
> not sure what to do.
>

How many keys?  If your data size is exceeding your OS's cache capacity (8GB
- JVM size) then a completely random read pattern is mostly going to test
how fast your disk can seek.  You can try to use faster disks, but the
better solution is to add more nodes.

-Brandon

RE: performance tuning - where does the slowness come from?

Posted by Mark Jones <MJ...@imagehawk.com>.

My data is single row/key to a 500 byte column and I'm reading ALL random keys (worst case read scenario)  Cache has minimal effectiveness, so the Bloom trees and indexes are getting a real work out.  I'm on 8GB Ubuntu 9.10 boxes (64bit).  Yea, I was griping about the performance earlier, disk is heavily used by Cassandra, so outside of going to some highend SAS stuff, not sure what to do.

From: Brandon Williams [mailto:driftx@gmail.com]
Sent: Wednesday, May 05, 2010 6:47 PM
To: user@cassandra.apache.org
Subject: Re: performance tuning - where does the slowness come from?

On Wed, May 5, 2010 at 6:36 PM, Mark Jones <MJ...@imagehawk.com>> wrote:
Have you actually managed to get 10K reads/second, or are you just estimating that you can?  I've run into similar issues, but I never got reads to scale when searching for unique keys even using 40 threads, I did discover that using 80+ threads, I can actually reduce performance.  I've never gotten more than 200-300 reads/second (steady state) off a 4 cluster node.  I can get roughly 8K writes/second to the same cluster (although I haven't tested both simultaneously with results worth talking about).

200-300/s is pretty low, especially if you can get 8k writes/s.  I would estimate that you should be getting at least a couple thousand reads/s.  Here is what I've gotten from a quad core machine with two disks:

http://racklabs.com/~bwilliam/cassandra/04vs05vs06.png

How much data are you testing with?

-Brandon

Re: performance tuning - where does the slowness come from?

Posted by Brandon Williams <dr...@gmail.com>.

On Wed, May 5, 2010 at 6:36 PM, Mark Jones <MJ...@imagehawk.com> wrote:

>  Have you actually managed to get 10K reads/second, or are you just
> estimating that you can?  I’ve run into similar issues, but I never got
> reads to scale when searching for unique keys even using 40 threads, I did
> discover that using 80+ threads, I can actually reduce performance.  I’ve
> never gotten more than 200-300 reads/second (steady state) off a 4 cluster
> node.  I can get roughly 8K writes/second to the same cluster (although I
> haven’t tested both simultaneously with results worth talking about).
>

200-300/s is pretty low, especially if you can get 8k writes/s.  I would
estimate that you should be getting at least a couple thousand reads/s.
 Here is what I've gotten from a quad core machine with two disks:

http://racklabs.com/~bwilliam/cassandra/04vs05vs06.png

How much data are you testing with?

-Brandon

RE: performance tuning - where does the slowness come from?

Posted by Mark Jones <MJ...@imagehawk.com>.

Have you actually managed to get 10K reads/second, or are you just estimating that you can?  I've run into similar issues, but I never got reads to scale when searching for unique keys even using 40 threads, I did discover that using 80+ threads, I can actually reduce performance.  I've never gotten more than 200-300 reads/second (steady state) off a 4 cluster node.  I can get roughly 8K writes/second to the same cluster (although I haven't tested both simultaneously with results worth talking about).

From: Ran Tavory [mailto:rantav@gmail.com]
Sent: Wednesday, May 05, 2010 4:59 PM
To: user@cassandra.apache.org
Subject: Re: performance tuning - where does the slowness come from?

let's see if I can make some assertions, feel free to correct me...

Well, obviously, reads are much slower in cassandra than writes, everyone knows that, but by which factor?
In my case I read/write only one column at a time. Key, column and value are pretty small (< 200b)
So the numbers are usually - Write Latency: ~0.05ms, Read Latency: ~30ms. So it looks like reads are 60x slower, at least on my hardware. This happens when cache is cold. If cache is warm reads are better, but unfortunately my cache is usually cold...
If my application keeps reading a cold cache there's nothing I can do to make reads faster from cassandra's side. With one client thread this implies 1000/30=33 reads/sec, not great.
However, although read latency is a bottleneck, read throughput isn't. So I need to add more reading threads and I can actually add many of them before read latency starts draining, according to ycsb I can have ~10000 reads/sec (on the cluster they tested, numbers may vary) before read latency starts draining. So, numbers may vary by cluster size, hardware, data size etc, but the idea is - if read latency is 30ms and cache is a miss most of the time, that's normal, just add more reader threads and you get a better throughput.
Sorry if this sounds trivial, I was just trying to improve on the 30ms reads until I realized I actually can't...

On Wed, May 5, 2010 at 7:08 PM, Jonathan Ellis <jb...@gmail.com>> wrote:
 - your key cache isn't warm.  capacity 17M, size 0.5M, 468083 reads
sounds like most of your reads have been for unique keys.
 - the kind of reads you are doing can have a big effect (mostly
number of columns you are asking for).  column index granularity plays
a role (for non-rowcached reads); so can column comparator (see e.g.
https://issues.apache.org/jira/browse/CASSANDRA-1043)
 - the slow system reads are all on HH rows, which can get very wide
(hence, slow to read the whole row, which is what the HH code does).
clean those out either by bringing back the nodes it's hinting for, or
just removing the HH data files.

On Wed, May 5, 2010 at 10:19 AM, Ran Tavory <ra...@gmail.com>> wrote:
> I'm still trying to figure out where my slowness is coming from...
> By now I'm pretty sure it's the reads are slow, but not sure how to improve
> them.
> I'm looking at cfstats. Can you say if there are better configuration
> options? So far I've used all default settings, except for:
>     <Keyspace Name="outbrain_kvdb">
>       <ColumnFamily CompareWith="BytesType" Name="KvImpressions"
> KeysCached="50%"/>
>
>  <ReplicaPlacementStrategy>org.apache.cassandra.locator.RackAwareStrategy</ReplicaPlacementStrategy>
>       <ReplicationFactor>2</ReplicationFactor>
>
>  <EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch>
>     </Keyspace>
>
> What does a good read latency look like? I was expecting 10ms, however so
> far it seems that my KvImpressions read latency is 30ms and in the system
> keyspace I have 800ms :(
> I thought adding KeysCached="50%" would improve my situation but
> unfortunately looks like the hitrate is about 0. I realize that's
> application specific, but maybe there are other magic bullets...
> Is there something like adding cache to the system keyspace? 800 ms is
> pretty bad, isn't it?
> See stats below and thanks.
>
> Keyspace: outbrain_kvdb
>         Read Count: 651668
>         Read Latency: 34.18622328547666 ms.
>         Write Count: 655542
>         Write Latency: 0.041145092152752985 ms.
>         Pending Tasks: 0
>                 Column Family: KvImpressions
>                 SSTable count: 13
>                 Space used (live): 23304548897
>                 Space used (total): 23304548897
>                 Memtable Columns Count: 895
>                 Memtable Data Size: 2108990
>                 Memtable Switch Count: 8
>                 Read Count: 468083
>                 Read Latency: 151.603 ms.
>                 Write Count: 552566
>                 Write Latency: 0.023 ms.
>                 Pending Tasks: 0
>                 Key cache capacity: 17398656
>                 Key cache size: 567967
>                 Key cache hit rate: 0.0
>                 Row cache: disabled
>                 Compacted row minimum size: 269
>                 Compacted row maximum size: 54501
>                 Compacted row mean size: 933
> ...
> ----------------
> Keyspace: system
>         Read Count: 1151
>         Read Latency: 872.5014448305822 ms.
>         Write Count: 51215
>         Write Latency: 0.07156788050375866 ms.
>         Pending Tasks: 0
>                 Column Family: HintsColumnFamily
>                 SSTable count: 5
>                 Space used (live): 437366878
>                 Space used (total): 437366878
>                 Memtable Columns Count: 14987
>                 Memtable Data Size: 87975
>                 Memtable Switch Count: 2
>                 Read Count: 1150
>                 Read Latency: NaN ms.
>                 Write Count: 51211
>                 Write Latency: 0.027 ms.
>                 Pending Tasks: 0
>                 Key cache capacity: 6
>                 Key cache size: 4
>                 Key cache hit rate: NaN
>                 Row cache: disabled
>                 Compacted row minimum size: 0
>                 Compacted row maximum size: 0
>                 Compacted row mean size: 0
>                 Column Family: LocationInfo
>                 SSTable count: 2
>                 Space used (live): 3504
>                 Space used (total): 3504
>                 Memtable Columns Count: 0
>                 Memtable Data Size: 0
>                 Memtable Switch Count: 1
>                 Read Count: 1
>                 Read Latency: NaN ms.
>                 Write Count: 7
>                 Write Latency: NaN ms.
>                 Pending Tasks: 0
>                 Key cache capacity: 2
>                 Key cache size: 1
>                 Key cache hit rate: NaN
>                 Row cache: disabled
>                 Compacted row minimum size: 0
>                 Compacted row maximum size: 0
>                 Compacted row mean size: 0
>
> On Tue, May 4, 2010 at 10:57 PM, Kyusik Chung <ky...@discovereads.com>>
> wrote:
>>
>> Im using Ubuntu 8.04 on 64 bit hosts on rackspace cloud.
>>
>> Im in the middle of repeating some perf tests, but so far, I get as-good
>> or slightly better read perf by using standard disk access mode vs mmap.  So
>> far consecutive tests are returning consistent numbers.
>>
>> Im not sure how to explain it...maybe its an ubuntu 8.04 issue with mmap.
>>  Back when I was using mmap, I was definitely seeing the kswapd0 process
>> start using cpu as the box ran out of memory, and read performance
>> significantly degraded.
>>
>> Next, Ill run some tests with mmap_index_only, and Ill test with heavy
>> concurrent writes as well as reads.  Ill let everyone know what I find.
>>
>> Kyusik Chung
>> CEO, Discovereads.com
>> kyusik@discovereads.com<ma...@discovereads.com>
>>
>> On May 4, 2010, at 12:27 PM, Jonathan Ellis wrote:
>>
>> > Are you using 32 bit hosts?  If not don't be scared of mmap using a
>> > lot of address space, you have plenty.  It won't make you swap more
>> > than using buffered i/o.
>> >
>> > On Tue, May 4, 2010 at 1:57 PM, Ran Tavory <ra...@gmail.com>> wrote:
>> >> I canceled mmap and indeed memory usage is sane again. So far
>> >> performance
>> >> hasn't been great, but I'll wait and see.
>> >> I'm also interested in a way to cap mmap so I can take advantage of it
>> >> but
>> >> not swap the host to death...
>> >>
>> >> On Tue, May 4, 2010 at 9:38 PM, Kyusik Chung <ky...@discovereads.com>>
>> >> wrote:
>> >>>
>> >>> This sounds just like the slowness I was asking about in another
>> >>> thread -
>> >>> after a lot of reads, the machine uses up all available memory on the
>> >>> box
>> >>> and then starts swapping.
>> >>> My understanding was that mmap helps greatly with read and write perf
>> >>> (until the box starts swapping I guess)...is there any way to use mmap
>> >>> and
>> >>> cap how much memory it takes up?
>> >>> What do people use in production?  mmap or no mmap?
>> >>> Thanks!
>> >>> Kyusik Chung
>> >>> On May 4, 2010, at 10:11 AM, Schubert Zhang wrote:
>> >>>
>> >>> 1. When initially startup your nodes, please plan your InitialToken of
>> >>> each node evenly.
>> >>> 2. <DiskAccessMode>standard</DiskAccessMode>
>> >>>
>> >>> On Tue, May 4, 2010 at 9:09 PM, Boris Shulman <sh...@gmail.com>>
>> >>> wrote:
>> >>>>
>> >>>> I think that the extra (more than 4GB) memory usage comes from the
>> >>>> mmaped io, that is why it happens only for reads.
>> >>>>
>> >>>> On Tue, May 4, 2010 at 2:02 PM, Jordan Pittier
>> >>>> <jo...@gmail.com>>
>> >>>> wrote:
>> >>>>> I'm facing the same issue with swap. It only occurs when I perform
>> >>>>> read
>> >>>>> operations (write are very fast :)). So I can't help you with the
>> >>>>> memory
>> >>>>> probleme.
>> >>>>>
>> >>>>> But to balance the load evenly between nodes in cluster just
>> >>>>> manually
>> >>>>> fix
>> >>>>> their token.(the "formula" is i * 2^127 / nb_nodes).
>> >>>>>
>> >>>>> Jordzn
>> >>>>>
>> >>>>> On Tue, May 4, 2010 at 8:20 AM, Ran Tavory <ra...@gmail.com>> wrote:
>> >>>>>>
>> >>>>>> I'm looking into performance issues on a 0.6.1 cluster. I see two
>> >>>>>> symptoms:
>> >>>>>> 1. Reads and writes are slow
>> >>>>>> 2. One of the hosts is doing a lot of GC.
>> >>>>>> 1 is slow in the sense that in normal state the cluster used to
>> >>>>>> make
>> >>>>>> around 3-5k read and writes per second (6-10k operations per
>> >>>>>> second),
>> >>>>>> but
>> >>>>>> how it's in the order of 200-400 ops per second, sometimes even
>> >>>>>> less.
>> >>>>>> 2 looks like this:
>> >>>>>> $ tail -f /outbrain/cassandra/log/system.log
>> >>>>>>  INFO [GC inspection] 2010-05-04 00:42:18,636 GCInspector.java
>> >>>>>> (line
>> >>>>>> 110)
>> >>>>>> GC for ParNew: 672 ms, 166482384 reclaimed leaving 2872087208 used;
>> >>>>>> max is
>> >>>>>> 4432068608
>> >>>>>>  INFO [GC inspection] 2010-05-04 00:42:28,638 GCInspector.java
>> >>>>>> (line
>> >>>>>> 110)
>> >>>>>> GC for ParNew: 498 ms, 166493352 reclaimed leaving 2836049448 used;
>> >>>>>> max is
>> >>>>>> 4432068608
>> >>>>>>  INFO [GC inspection] 2010-05-04 00:42:38,640 GCInspector.java
>> >>>>>> (line
>> >>>>>> 110)
>> >>>>>> GC for ParNew: 327 ms, 166091528 reclaimed leaving 2796888424 used;
>> >>>>>> max is
>> >>>>>> 4432068608
>> >>>>>> ... and it goes on and on for hours, no stopping...
>> >>>>>> The cluster is made of 6 hosts, 3 in one DC and 3 in another.
>> >>>>>> Each host has 8G RAM.
>> >>>>>> -Xmx=4G
>> >>>>>> For some reason, the load isn't distributed evenly b/w the hosts,
>> >>>>>> although
>> >>>>>> I'm not sure this is the cause for slowness
>> >>>>>> $ nodetool -h localhost -p 9004 ring
>> >>>>>> Address       Status     Load          Range
>> >>>>>>        Ring
>> >>>>>>
>> >>>>>> 144413773383729447702215082383444206680
>> >>>>>> 192.168.252.99Up         15.94 GB
>> >>>>>>  66002764663998929243644931915471302076     |<--|
>> >>>>>> 192.168.254.57Up         19.84 GB
>> >>>>>>  81288739225600737067856268063987022738     |   ^
>> >>>>>> 192.168.254.58Up         973.78 MB
>> >>>>>> 86999744104066390588161689990810839743     v   |
>> >>>>>> 192.168.252.62Up         5.18 GB
>> >>>>>> 88308919879653155454332084719458267849     |   ^
>> >>>>>> 192.168.254.59Up         10.57 GB
>> >>>>>>  142482163220375328195837946953175033937    v   |
>> >>>>>> 192.168.252.61Up         11.36 GB
>> >>>>>>  144413773383729447702215082383444206680    |-->|
>> >>>>>> The slow host is 192.168.252.61 and it isn't the most loaded one.
>> >>>>>> The host is waiting a lot on IO and the load average is usually 6-7
>> >>>>>> $ w
>> >>>>>>  00:42:56 up 11 days, 13:22,  1 user,  load average: 6.21, 5.52,
>> >>>>>> 3.93
>> >>>>>> $ vmstat 5
>> >>>>>> procs -----------memory---------- ---swap-- -----io---- --system--
>> >>>>>> -----cpu------
>> >>>>>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs
>> >>>>>> us
>> >>>>>> sy id
>> >>>>>> wa st
>> >>>>>>  0  8 2147844  45744   1816 4457384    6    5    66    32    5    2
>> >>>>>>  1
>> >>>>>>  1
>> >>>>>> 96  2  0
>> >>>>>>  0  8 2147164  49020   1808 4451596  385    0  2345    58 3372 9957
>> >>>>>>  2
>> >>>>>>  2
>> >>>>>> 78 18  0
>> >>>>>>  0  3 2146432  45704   1812 4453956  342    0  2274   108 3937
>> >>>>>> 10732
>> >>>>>>  2  2
>> >>>>>> 78 19  0
>> >>>>>>  0  1 2146252  44696   1804 4453436  345  164  1939   294 3647 7833
>> >>>>>>  2
>> >>>>>>  2
>> >>>>>> 78 18  0
>> >>>>>>  0  1 2145960  46924   1744 4451260  158    0  2423   122 4354
>> >>>>>> 14597
>> >>>>>>  2  2
>> >>>>>> 77 18  0
>> >>>>>>  7  1 2138344  44676    952 4504148 1722  403  1722   406 1388  439
>> >>>>>> 87
>> >>>>>>  0
>> >>>>>> 10  2  0
>> >>>>>>  7  2 2137248  45652    956 4499436 1384  655  1384   658 1356  392
>> >>>>>> 87
>> >>>>>>  0
>> >>>>>> 10  3  0
>> >>>>>>  7  1 2135976  46764    956 4495020 1366  718  1366   718 1395  380
>> >>>>>> 87
>> >>>>>>  0
>> >>>>>>  9  4  0
>> >>>>>>  0  8 2134484  46964    956 4489420 1673  555  1814   586 1601
>> >>>>>> 215590
>> >>>>>> 14
>> >>>>>>  2 68 16  0
>> >>>>>>  0  1 2135388  47444    972 4488516  785  833  2390   995 3812 8305
>> >>>>>>  2
>> >>>>>>  2
>> >>>>>> 77 20  0
>> >>>>>>  0 10 2135164  45928    980 4488796  788  543  2275   626 36
>> >>>>>> So, the host is swapping like crazy...
>> >>>>>> top shows that it's using a lot of memory. As noted before -Xmx=4G
>> >>>>>> and
>> >>>>>> nothing else seems to be using a lot of memory on the host except
>> >>>>>> for
>> >>>>>> the
>> >>>>>> cassandra process, however, of the 8G ram on the host, 92% is used
>> >>>>>> by
>> >>>>>> cassandra. How's that?
>> >>>>>> Top shows there's 3.9g Shared and 7.2g Resident and 15.9g Virtual.
>> >>>>>> Why
>> >>>>>> does it have 15g virtual? And why 7.2 RES? This can explain the
>> >>>>>> slowness in
>> >>>>>> swapping.
>> >>>>>> $ top
>> >>>>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
>> >>>>>>  COMMAND
>> >>>>>>
>> >>>>>>
>> >>>>>> 20281 cassandr  25   0 15.9g 7.2g 3.9g S 33.3 92.6 175:30.27 java
>> >>>>>> So, can the total memory be controlled?
>> >>>>>> Or perhaps I'm looking in the wrong direction...
>> >>>>>> I've looked at all the cassandra JMX counts and nothing seemed
>> >>>>>> suspicious
>> >>>>>> so far. By suspicious i mean a large number of pending tasks -
>> >>>>>> there
>> >>>>>> were
>> >>>>>> always very small numbers in each pool.
>> >>>>>> About read and write latencies, I'm not sure what the normal state
>> >>>>>> is,
>> >>>>>> but
>> >>>>>> here's an example of what I see on the problematic host:
>> >>>>>> #mbean = org.apache.cassandra.service:type=StorageProxy:
>> >>>>>> RecentReadLatencyMicros = 30105.888180684495;
>> >>>>>> TotalReadLatencyMicros = 78543052801;
>> >>>>>> TotalWriteLatencyMicros = 4213118609;
>> >>>>>> RecentWriteLatencyMicros = 1444.4809201925639;
>> >>>>>> ReadOperations = 4779553;
>> >>>>>> RangeOperations = 0;
>> >>>>>> TotalRangeLatencyMicros = 0;
>> >>>>>> RecentRangeLatencyMicros = NaN;
>> >>>>>> WriteOperations = 4740093;
>> >>>>>> And the only pool that I do see some pending tasks is the
>> >>>>>> ROW-READ-STAGE,
>> >>>>>> but it doesn't look like much, usually around 6-8:
>> >>>>>> #mbean = org.apache.cassandra.concurrent:type=ROW-READ-STAGE:
>> >>>>>> ActiveCount = 8;
>> >>>>>> PendingTasks = 8;
>> >>>>>> CompletedTasks = 5427955;
>> >>>>>> Any help finding the solution is appreciated, thanks...
>> >>>>>> Below are a few more JMXes I collected from the system that may be
>> >>>>>> interesting.
>> >>>>>> #mbean = java.lang:type=Memory:
>> >>>>>> Verbose = false;
>> >>>>>> HeapMemoryUsage = {
>> >>>>>>   committed = 3767279616;
>> >>>>>>   init = 134217728;
>> >>>>>>   max = 4293656576;
>> >>>>>>   used = 1237105080;
>> >>>>>>  };
>> >>>>>> NonHeapMemoryUsage = {
>> >>>>>>   committed = 35061760;
>> >>>>>>   init = 24313856;
>> >>>>>>   max = 138412032;
>> >>>>>>   used = 23151320;
>> >>>>>>  };
>> >>>>>> ObjectPendingFinalizationCount = 0;
>> >>>>>> #mbean = java.lang:name=ParNew,type=GarbageCollector:
>> >>>>>> LastGcInfo = {
>> >>>>>>   GcThreadCount = 11;
>> >>>>>>   duration = 136;
>> >>>>>>   endTime = 42219272;
>> >>>>>>   id = 11719;
>> >>>>>>   memoryUsageAfterGc = {
>> >>>>>>     ( CMS Perm Gen ) = {
>> >>>>>>       key = CMS Perm Gen;
>> >>>>>>       value = {
>> >>>>>>         committed = 29229056;
>> >>>>>>         init = 21757952;
>> >>>>>>         max = 88080384;
>> >>>>>>         used = 17648848;
>> >>>>>>        };
>> >>>>>>      };
>> >>>>>>     ( Code Cache ) = {
>> >>>>>>       key = Code Cache;
>> >>>>>>       value = {
>> >>>>>>         committed = 5832704;
>> >>>>>>         init = 2555904;
>> >>>>>>         max = 50331648;
>> >>>>>>         used = 5563520;
>> >>>>>>        };
>> >>>>>>      };
>> >>>>>>     ( CMS Old Gen ) = {
>> >>>>>>       key = CMS Old Gen;
>> >>>>>>       value = {
>> >>>>>>         committed = 3594133504;
>> >>>>>>         init = 112459776;
>> >>>>>>         max = 4120510464;
>> >>>>>>         used = 964565720;
>> >>>>>>        };
>> >>>>>>      };
>> >>>>>>     ( Par Eden Space ) = {
>> >>>>>>       key = Par Eden Space;
>> >>>>>>       value = {
>> >>>>>>         committed = 171835392;
>> >>>>>>         init = 21495808;
>> >>>>>>         max = 171835392;
>> >>>>>>         used = 0;
>> >>>>>>        };
>> >>>>>>      };
>> >>>>>>     ( Par Survivor Space ) = {
>> >>>>>>       key = Par Survivor Space;
>> >>>>>>       value = {
>> >>>>>>         committed = 1310720;
>> >>>>>>         init = 131072;
>> >>>>>>         max = 1310720;
>> >>>>>>         used = 0;
>> >>>>>>        };
>> >>>>>>      };
>> >>>>>>    };
>> >>>>>>   memoryUsageBeforeGc = {
>> >>>>>>     ( CMS Perm Gen ) = {
>> >>>>>>       key = CMS Perm Gen;
>> >>>>>>       value = {
>> >>>>>>         committed = 29229056;
>> >>>>>>         init = 21757952;
>> >>>>>>         max = 88080384;
>> >>>>>>         used = 17648848;
>> >>>>>>        };
>> >>>>>>      };
>> >>>>>>     ( Code Cache ) = {
>> >>>>>>       key = Code Cache;
>> >>>>>>       value = {
>> >>>>>>         committed = 5832704;
>> >>>>>>         init = 2555904;
>> >>>>>>         max = 50331648;
>> >>>>>>         used = 5563520;
>> >>>>>>        };
>> >>>>>>      };
>> >>>>>>     ( CMS Old Gen ) = {
>> >>>>>>       key = CMS Old Gen;
>> >>>>>>       value = {
>> >>>>>>         committed = 3594133504;
>> >>>>>>         init = 112459776;
>> >>>>>>         max = 4120510464;
>> >>>>>>         used = 959221872;
>> >>>>>>        };
>> >>>>>>      };
>> >>>>>>     ( Par Eden Space ) = {
>> >>>>>>       key = Par Eden Space;
>> >>>>>>       value = {
>> >>>>>>         committed = 171835392;
>> >>>>>>         init = 21495808;
>> >>>>>>         max = 171835392;
>> >>>>>>         used = 171835392;
>> >>>>>>        };
>> >>>>>>      };
>> >>>>>>     ( Par Survivor Space ) = {
>> >>>>>>       key = Par Survivor Space;
>> >>>>>>       value = {
>> >>>>>>         committed = 1310720;
>> >>>>>>         init = 131072;
>> >>>>>>         max = 1310720;
>> >>>>>>         used = 0;
>> >>>>>>        };
>> >>>>>>      };
>> >>>>>>    };
>> >>>>>>   startTime = 42219136;
>> >>>>>>  };
>> >>>>>> CollectionCount = 11720;
>> >>>>>> CollectionTime = 4561730;
>> >>>>>> Name = ParNew;
>> >>>>>> Valid = true;
>> >>>>>> MemoryPoolNames = [ Par Eden Space, Par Survivor Space ];
>> >>>>>> #mbean = java.lang:type=OperatingSystem:
>> >>>>>> MaxFileDescriptorCount = 63536;
>> >>>>>> OpenFileDescriptorCount = 75;
>> >>>>>> CommittedVirtualMemorySize = 17787711488;
>> >>>>>> FreePhysicalMemorySize = 45522944;
>> >>>>>> FreeSwapSpaceSize = 2123968512;
>> >>>>>> ProcessCpuTime = 12251460000000;
>> >>>>>> TotalPhysicalMemorySize = 8364417024;
>> >>>>>> TotalSwapSpaceSize = 4294959104;
>> >>>>>> Name = Linux;
>> >>>>>> AvailableProcessors = 8;
>> >>>>>> Arch = amd64;
>> >>>>>> SystemLoadAverage = 4.36;
>> >>>>>> Version = 2.6.18-164.15.1.el5;
>> >>>>>> #mbean = java.lang:type=Runtime:
>> >>>>>> Name = 20281@ob1061.nydc1.outbrain.com<ma...@ob1061.nydc1.outbrain.com>;
>> >>>>>>
>> >>>>>> ClassPath =
>> >>>>>>
>> >>>>>>
>> >>>>>> /outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../build/classes:/outbrain/cassandra/apache-cassandra-0.6.1/bin/..
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> /lib/antlr-3.1.3.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/apache-cassandra-0.6.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/avro-1.2.0-dev.jar:/outb
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> rain/cassandra/apache-cassandra-0.6.1/bin/../lib/clhm-production.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-cli-1.1.jar:/outbrain/cassandra/apache-cassandra-
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> 0.6.1/bin/../lib/commons-codec-1.2.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-collections-3.2.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/com
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> mons-lang-2.4.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/google-collections-1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/hadoop-core-0.20.1.jar:/out
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> brain/cassandra/apache-cassandra-0.6.1/bin/../lib/high-scale-lib.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/ivy-2.1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> bin/../lib/jackson-core-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jackson-mapper-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jline
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> -0.9.94.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/json-simple-1.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/libthrift-r917130.jar:/outbrain/cassandr
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> a/apache-cassandra-0.6.1/bin/../lib/log4j-1.2.14.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/slf4j-api-1.5.8.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib
>> >>>>>> /slf4j-log4j12-1.5.8.jar;
>> >>>>>>
>> >>>>>> BootClassPath =
>> >>>>>>
>> >>>>>>
>> >>>>>> /usr/java/jdk1.6.0_17/jre/lib/alt-rt.jar:/usr/java/jdk1.6.0_17/jre/lib/resources.jar:/usr/java/jdk1.6.0_17/jre/lib/rt.jar:/usr/java/jdk1.6.0_17/jre/lib/sunrsasign.j
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> ar:/usr/java/jdk1.6.0_17/jre/lib/jsse.jar:/usr/java/jdk1.6.0_17/jre/lib/jce.jar:/usr/java/jdk1.6.0_17/jre/lib/charsets.jar:/usr/java/jdk1.6.0_17/jre/classes;
>> >>>>>>
>> >>>>>> LibraryPath =
>> >>>>>>
>> >>>>>>
>> >>>>>> /usr/java/jdk1.6.0_17/jre/lib/amd64/server:/usr/java/jdk1.6.0_17/jre/lib/amd64:/usr/java/jdk1.6.0_17/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib;
>> >>>>>>
>> >>>>>> VmName = Java HotSpot(TM) 64-Bit Server VM;
>> >>>>>>
>> >>>>>> VmVendor = Sun Microsystems Inc.;
>> >>>>>>
>> >>>>>> VmVersion = 14.3-b01;
>> >>>>>>
>> >>>>>> BootClassPathSupported = true;
>> >>>>>>
>> >>>>>> InputArguments = [ -ea, -Xms128M, -Xmx4G,
>> >>>>>> -XX:TargetSurvivorRatio=90,
>> >>>>>> -XX:+AggressiveOpts, -XX:+UseParNewGC, -XX:+UseConcMarkSweepGC,
>> >>>>>> -XX:+CMSParallelRemarkEnabled, -XX:+HeapDumpOnOutOfMemoryError,
>> >>>>>> -XX:SurvivorRatio=128, -XX:MaxTenuringThreshold=0,
>> >>>>>> -Dcom.sun.management.jmxremote.port=9004,
>> >>>>>> -Dcom.sun.management.jmxremote.ssl=false,
>> >>>>>> -Dcom.sun.management.jmxremote.authenticate=false,
>> >>>>>>
>> >>>>>>
>> >>>>>> -Dstorage-config=/outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf,
>> >>>>>> -Dcassandra-pidfile=/var/run/cassandra.pid ];
>> >>>>>>
>> >>>>>> ManagementSpecVersion = 1.2;
>> >>>>>>
>> >>>>>> SpecName = Java Virtual Machine Specification;
>> >>>>>>
>> >>>>>> SpecVendor = Sun Microsystems Inc.;
>> >>>>>>
>> >>>>>> SpecVersion = 1.0;
>> >>>>>>
>> >>>>>> StartTime = 1272911001415;
>> >>>>>> ...
>> >>>>>
>> >>>
>> >>>
>> >>
>> >>
>> >
>> >
>> >
>> > --
>> > Jonathan Ellis
>> > Project Chair, Apache Cassandra
>> > co-founder of Riptano, the source for professional Cassandra support
>> > http://riptano.com
>>
>
>


--
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: performance tuning - where does the slowness come from?

Posted by Ran Tavory <ra...@gmail.com>.

let's see if I can make some assertions, feel free to correct me...

Well, obviously, reads are much slower in cassandra than writes, everyone
knows that, but by which factor?
In my case I read/write only one column at a time. Key, column and value are
pretty small (< 200b)
So the numbers are usually - Write Latency: ~0.05ms, Read Latency: ~30ms. So
it looks like reads are 60x slower, at least on my hardware. This happens
when cache is cold. If cache is warm reads are better, but unfortunately my
cache is usually cold...
If my application keeps reading a cold cache there's nothing I can do to
make reads faster from cassandra's side. With one client thread this implies
1000/30=33 reads/sec, not great.
However, although read latency is a bottleneck, read throughput isn't. So I
need to add more reading threads and I can actually add many of them before
read latency starts draining, according to ycsb I can have ~10000 reads/sec
(on the cluster they tested, numbers may vary) before read latency starts
draining. So, numbers may vary by cluster size, hardware, data size etc, but
the idea is - if read latency is 30ms and cache is a miss most of the time,
that's normal, just add more reader threads and you get a better throughput.
Sorry if this sounds trivial, I was just trying to improve on the 30ms reads
until I realized I actually can't...

On Wed, May 5, 2010 at 7:08 PM, Jonathan Ellis <jb...@gmail.com> wrote:

>  - your key cache isn't warm.  capacity 17M, size 0.5M, 468083 reads
> sounds like most of your reads have been for unique keys.
>  - the kind of reads you are doing can have a big effect (mostly
> number of columns you are asking for).  column index granularity plays
> a role (for non-rowcached reads); so can column comparator (see e.g.
> https://issues.apache.org/jira/browse/CASSANDRA-1043)
>  - the slow system reads are all on HH rows, which can get very wide
> (hence, slow to read the whole row, which is what the HH code does).
> clean those out either by bringing back the nodes it's hinting for, or
> just removing the HH data files.
>
> On Wed, May 5, 2010 at 10:19 AM, Ran Tavory <ra...@gmail.com> wrote:
> > I'm still trying to figure out where my slowness is coming from...
> > By now I'm pretty sure it's the reads are slow, but not sure how to
> improve
> > them.
> > I'm looking at cfstats. Can you say if there are better configuration
> > options? So far I've used all default settings, except for:
> >     <Keyspace Name="outbrain_kvdb">
> >       <ColumnFamily CompareWith="BytesType" Name="KvImpressions"
> > KeysCached="50%"/>
> >
> >
>  <ReplicaPlacementStrategy>org.apache.cassandra.locator.RackAwareStrategy</ReplicaPlacementStrategy>
> >       <ReplicationFactor>2</ReplicationFactor>
> >
> >
>  <EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch>
> >     </Keyspace>
> >
> > What does a good read latency look like? I was expecting 10ms, however so
> > far it seems that my KvImpressions read latency is 30ms and in the system
> > keyspace I have 800ms :(
> > I thought adding KeysCached="50%" would improve my situation but
> > unfortunately looks like the hitrate is about 0. I realize that's
> > application specific, but maybe there are other magic bullets...
> > Is there something like adding cache to the system keyspace? 800 ms is
> > pretty bad, isn't it?
> > See stats below and thanks.
> >
> > Keyspace: outbrain_kvdb
> >         Read Count: 651668
> >         Read Latency: 34.18622328547666 ms.
> >         Write Count: 655542
> >         Write Latency: 0.041145092152752985 ms.
> >         Pending Tasks: 0
> >                 Column Family: KvImpressions
> >                 SSTable count: 13
> >                 Space used (live): 23304548897
> >                 Space used (total): 23304548897
> >                 Memtable Columns Count: 895
> >                 Memtable Data Size: 2108990
> >                 Memtable Switch Count: 8
> >                 Read Count: 468083
> >                 Read Latency: 151.603 ms.
> >                 Write Count: 552566
> >                 Write Latency: 0.023 ms.
> >                 Pending Tasks: 0
> >                 Key cache capacity: 17398656
> >                 Key cache size: 567967
> >                 Key cache hit rate: 0.0
> >                 Row cache: disabled
> >                 Compacted row minimum size: 269
> >                 Compacted row maximum size: 54501
> >                 Compacted row mean size: 933
> > ...
> > ----------------
> > Keyspace: system
> >         Read Count: 1151
> >         Read Latency: 872.5014448305822 ms.
> >         Write Count: 51215
> >         Write Latency: 0.07156788050375866 ms.
> >         Pending Tasks: 0
> >                 Column Family: HintsColumnFamily
> >                 SSTable count: 5
> >                 Space used (live): 437366878
> >                 Space used (total): 437366878
> >                 Memtable Columns Count: 14987
> >                 Memtable Data Size: 87975
> >                 Memtable Switch Count: 2
> >                 Read Count: 1150
> >                 Read Latency: NaN ms.
> >                 Write Count: 51211
> >                 Write Latency: 0.027 ms.
> >                 Pending Tasks: 0
> >                 Key cache capacity: 6
> >                 Key cache size: 4
> >                 Key cache hit rate: NaN
> >                 Row cache: disabled
> >                 Compacted row minimum size: 0
> >                 Compacted row maximum size: 0
> >                 Compacted row mean size: 0
> >                 Column Family: LocationInfo
> >                 SSTable count: 2
> >                 Space used (live): 3504
> >                 Space used (total): 3504
> >                 Memtable Columns Count: 0
> >                 Memtable Data Size: 0
> >                 Memtable Switch Count: 1
> >                 Read Count: 1
> >                 Read Latency: NaN ms.
> >                 Write Count: 7
> >                 Write Latency: NaN ms.
> >                 Pending Tasks: 0
> >                 Key cache capacity: 2
> >                 Key cache size: 1
> >                 Key cache hit rate: NaN
> >                 Row cache: disabled
> >                 Compacted row minimum size: 0
> >                 Compacted row maximum size: 0
> >                 Compacted row mean size: 0
> >
> > On Tue, May 4, 2010 at 10:57 PM, Kyusik Chung <ky...@discovereads.com>
> > wrote:
> >>
> >> Im using Ubuntu 8.04 on 64 bit hosts on rackspace cloud.
> >>
> >> Im in the middle of repeating some perf tests, but so far, I get as-good
> >> or slightly better read perf by using standard disk access mode vs mmap.
>  So
> >> far consecutive tests are returning consistent numbers.
> >>
> >> Im not sure how to explain it...maybe its an ubuntu 8.04 issue with
> mmap.
> >>  Back when I was using mmap, I was definitely seeing the kswapd0 process
> >> start using cpu as the box ran out of memory, and read performance
> >> significantly degraded.
> >>
> >> Next, Ill run some tests with mmap_index_only, and Ill test with heavy
> >> concurrent writes as well as reads.  Ill let everyone know what I find.
> >>
> >> Kyusik Chung
> >> CEO, Discovereads.com
> >> kyusik@discovereads.com
> >>
> >> On May 4, 2010, at 12:27 PM, Jonathan Ellis wrote:
> >>
> >> > Are you using 32 bit hosts?  If not don't be scared of mmap using a
> >> > lot of address space, you have plenty.  It won't make you swap more
> >> > than using buffered i/o.
> >> >
> >> > On Tue, May 4, 2010 at 1:57 PM, Ran Tavory <ra...@gmail.com> wrote:
> >> >> I canceled mmap and indeed memory usage is sane again. So far
> >> >> performance
> >> >> hasn't been great, but I'll wait and see.
> >> >> I'm also interested in a way to cap mmap so I can take advantage of
> it
> >> >> but
> >> >> not swap the host to death...
> >> >>
> >> >> On Tue, May 4, 2010 at 9:38 PM, Kyusik Chung <
> kyusik@discovereads.com>
> >> >> wrote:
> >> >>>
> >> >>> This sounds just like the slowness I was asking about in another
> >> >>> thread -
> >> >>> after a lot of reads, the machine uses up all available memory on
> the
> >> >>> box
> >> >>> and then starts swapping.
> >> >>> My understanding was that mmap helps greatly with read and write
> perf
> >> >>> (until the box starts swapping I guess)...is there any way to use
> mmap
> >> >>> and
> >> >>> cap how much memory it takes up?
> >> >>> What do people use in production?  mmap or no mmap?
> >> >>> Thanks!
> >> >>> Kyusik Chung
> >> >>> On May 4, 2010, at 10:11 AM, Schubert Zhang wrote:
> >> >>>
> >> >>> 1. When initially startup your nodes, please plan your InitialToken
> of
> >> >>> each node evenly.
> >> >>> 2. <DiskAccessMode>standard</DiskAccessMode>
> >> >>>
> >> >>> On Tue, May 4, 2010 at 9:09 PM, Boris Shulman <sh...@gmail.com>
> >> >>> wrote:
> >> >>>>
> >> >>>> I think that the extra (more than 4GB) memory usage comes from the
> >> >>>> mmaped io, that is why it happens only for reads.
> >> >>>>
> >> >>>> On Tue, May 4, 2010 at 2:02 PM, Jordan Pittier
> >> >>>> <jo...@gmail.com>
> >> >>>> wrote:
> >> >>>>> I'm facing the same issue with swap. It only occurs when I perform
> >> >>>>> read
> >> >>>>> operations (write are very fast :)). So I can't help you with the
> >> >>>>> memory
> >> >>>>> probleme.
> >> >>>>>
> >> >>>>> But to balance the load evenly between nodes in cluster just
> >> >>>>> manually
> >> >>>>> fix
> >> >>>>> their token.(the "formula" is i * 2^127 / nb_nodes).
> >> >>>>>
> >> >>>>> Jordzn
> >> >>>>>
> >> >>>>> On Tue, May 4, 2010 at 8:20 AM, Ran Tavory <ra...@gmail.com>
> wrote:
> >> >>>>>>
> >> >>>>>> I'm looking into performance issues on a 0.6.1 cluster. I see two
> >> >>>>>> symptoms:
> >> >>>>>> 1. Reads and writes are slow
> >> >>>>>> 2. One of the hosts is doing a lot of GC.
> >> >>>>>> 1 is slow in the sense that in normal state the cluster used to
> >> >>>>>> make
> >> >>>>>> around 3-5k read and writes per second (6-10k operations per
> >> >>>>>> second),
> >> >>>>>> but
> >> >>>>>> how it's in the order of 200-400 ops per second, sometimes even
> >> >>>>>> less.
> >> >>>>>> 2 looks like this:
> >> >>>>>> $ tail -f /outbrain/cassandra/log/system.log
> >> >>>>>>  INFO [GC inspection] 2010-05-04 00:42:18,636 GCInspector.java
> >> >>>>>> (line
> >> >>>>>> 110)
> >> >>>>>> GC for ParNew: 672 ms, 166482384 reclaimed leaving 2872087208
> used;
> >> >>>>>> max is
> >> >>>>>> 4432068608
> >> >>>>>>  INFO [GC inspection] 2010-05-04 00:42:28,638 GCInspector.java
> >> >>>>>> (line
> >> >>>>>> 110)
> >> >>>>>> GC for ParNew: 498 ms, 166493352 reclaimed leaving 2836049448
> used;
> >> >>>>>> max is
> >> >>>>>> 4432068608
> >> >>>>>>  INFO [GC inspection] 2010-05-04 00:42:38,640 GCInspector.java
> >> >>>>>> (line
> >> >>>>>> 110)
> >> >>>>>> GC for ParNew: 327 ms, 166091528 reclaimed leaving 2796888424
> used;
> >> >>>>>> max is
> >> >>>>>> 4432068608
> >> >>>>>> ... and it goes on and on for hours, no stopping...
> >> >>>>>> The cluster is made of 6 hosts, 3 in one DC and 3 in another.
> >> >>>>>> Each host has 8G RAM.
> >> >>>>>> -Xmx=4G
> >> >>>>>> For some reason, the load isn't distributed evenly b/w the hosts,
> >> >>>>>> although
> >> >>>>>> I'm not sure this is the cause for slowness
> >> >>>>>> $ nodetool -h localhost -p 9004 ring
> >> >>>>>> Address       Status     Load          Range
> >> >>>>>>        Ring
> >> >>>>>>
> >> >>>>>> 144413773383729447702215082383444206680
> >> >>>>>> 192.168.252.99Up         15.94 GB
> >> >>>>>>  66002764663998929243644931915471302076     |<--|
> >> >>>>>> 192.168.254.57Up         19.84 GB
> >> >>>>>>  81288739225600737067856268063987022738     |   ^
> >> >>>>>> 192.168.254.58Up         973.78 MB
> >> >>>>>> 86999744104066390588161689990810839743     v   |
> >> >>>>>> 192.168.252.62Up         5.18 GB
> >> >>>>>> 88308919879653155454332084719458267849     |   ^
> >> >>>>>> 192.168.254.59Up         10.57 GB
> >> >>>>>>  142482163220375328195837946953175033937    v   |
> >> >>>>>> 192.168.252.61Up         11.36 GB
> >> >>>>>>  144413773383729447702215082383444206680    |-->|
> >> >>>>>> The slow host is 192.168.252.61 and it isn't the most loaded one.
> >> >>>>>> The host is waiting a lot on IO and the load average is usually
> 6-7
> >> >>>>>> $ w
> >> >>>>>>  00:42:56 up 11 days, 13:22,  1 user,  load average: 6.21, 5.52,
> >> >>>>>> 3.93
> >> >>>>>> $ vmstat 5
> >> >>>>>> procs -----------memory---------- ---swap-- -----io----
> --system--
> >> >>>>>> -----cpu------
> >> >>>>>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs
> >> >>>>>> us
> >> >>>>>> sy id
> >> >>>>>> wa st
> >> >>>>>>  0  8 2147844  45744   1816 4457384    6    5    66    32    5
>  2
> >> >>>>>>  1
> >> >>>>>>  1
> >> >>>>>> 96  2  0
> >> >>>>>>  0  8 2147164  49020   1808 4451596  385    0  2345    58 3372
> 9957
> >> >>>>>>  2
> >> >>>>>>  2
> >> >>>>>> 78 18  0
> >> >>>>>>  0  3 2146432  45704   1812 4453956  342    0  2274   108 3937
> >> >>>>>> 10732
> >> >>>>>>  2  2
> >> >>>>>> 78 19  0
> >> >>>>>>  0  1 2146252  44696   1804 4453436  345  164  1939   294 3647
> 7833
> >> >>>>>>  2
> >> >>>>>>  2
> >> >>>>>> 78 18  0
> >> >>>>>>  0  1 2145960  46924   1744 4451260  158    0  2423   122 4354
> >> >>>>>> 14597
> >> >>>>>>  2  2
> >> >>>>>> 77 18  0
> >> >>>>>>  7  1 2138344  44676    952 4504148 1722  403  1722   406 1388
>  439
> >> >>>>>> 87
> >> >>>>>>  0
> >> >>>>>> 10  2  0
> >> >>>>>>  7  2 2137248  45652    956 4499436 1384  655  1384   658 1356
>  392
> >> >>>>>> 87
> >> >>>>>>  0
> >> >>>>>> 10  3  0
> >> >>>>>>  7  1 2135976  46764    956 4495020 1366  718  1366   718 1395
>  380
> >> >>>>>> 87
> >> >>>>>>  0
> >> >>>>>>  9  4  0
> >> >>>>>>  0  8 2134484  46964    956 4489420 1673  555  1814   586 1601
> >> >>>>>> 215590
> >> >>>>>> 14
> >> >>>>>>  2 68 16  0
> >> >>>>>>  0  1 2135388  47444    972 4488516  785  833  2390   995 3812
> 8305
> >> >>>>>>  2
> >> >>>>>>  2
> >> >>>>>> 77 20  0
> >> >>>>>>  0 10 2135164  45928    980 4488796  788  543  2275   626 36
> >> >>>>>> So, the host is swapping like crazy...
> >> >>>>>> top shows that it's using a lot of memory. As noted before
> -Xmx=4G
> >> >>>>>> and
> >> >>>>>> nothing else seems to be using a lot of memory on the host except
> >> >>>>>> for
> >> >>>>>> the
> >> >>>>>> cassandra process, however, of the 8G ram on the host, 92% is
> used
> >> >>>>>> by
> >> >>>>>> cassandra. How's that?
> >> >>>>>> Top shows there's 3.9g Shared and 7.2g Resident and 15.9g
> Virtual.
> >> >>>>>> Why
> >> >>>>>> does it have 15g virtual? And why 7.2 RES? This can explain the
> >> >>>>>> slowness in
> >> >>>>>> swapping.
> >> >>>>>> $ top
> >> >>>>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
> >> >>>>>>  COMMAND
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> 20281 cassandr  25   0 15.9g 7.2g 3.9g S 33.3 92.6 175:30.27 java
> >> >>>>>> So, can the total memory be controlled?
> >> >>>>>> Or perhaps I'm looking in the wrong direction...
> >> >>>>>> I've looked at all the cassandra JMX counts and nothing seemed
> >> >>>>>> suspicious
> >> >>>>>> so far. By suspicious i mean a large number of pending tasks -
> >> >>>>>> there
> >> >>>>>> were
> >> >>>>>> always very small numbers in each pool.
> >> >>>>>> About read and write latencies, I'm not sure what the normal
> state
> >> >>>>>> is,
> >> >>>>>> but
> >> >>>>>> here's an example of what I see on the problematic host:
> >> >>>>>> #mbean = org.apache.cassandra.service:type=StorageProxy:
> >> >>>>>> RecentReadLatencyMicros = 30105.888180684495;
> >> >>>>>> TotalReadLatencyMicros = 78543052801;
> >> >>>>>> TotalWriteLatencyMicros = 4213118609;
> >> >>>>>> RecentWriteLatencyMicros = 1444.4809201925639;
> >> >>>>>> ReadOperations = 4779553;
> >> >>>>>> RangeOperations = 0;
> >> >>>>>> TotalRangeLatencyMicros = 0;
> >> >>>>>> RecentRangeLatencyMicros = NaN;
> >> >>>>>> WriteOperations = 4740093;
> >> >>>>>> And the only pool that I do see some pending tasks is the
> >> >>>>>> ROW-READ-STAGE,
> >> >>>>>> but it doesn't look like much, usually around 6-8:
> >> >>>>>> #mbean = org.apache.cassandra.concurrent:type=ROW-READ-STAGE:
> >> >>>>>> ActiveCount = 8;
> >> >>>>>> PendingTasks = 8;
> >> >>>>>> CompletedTasks = 5427955;
> >> >>>>>> Any help finding the solution is appreciated, thanks...
> >> >>>>>> Below are a few more JMXes I collected from the system that may
> be
> >> >>>>>> interesting.
> >> >>>>>> #mbean = java.lang:type=Memory:
> >> >>>>>> Verbose = false;
> >> >>>>>> HeapMemoryUsage = {
> >> >>>>>>   committed = 3767279616;
> >> >>>>>>   init = 134217728;
> >> >>>>>>   max = 4293656576;
> >> >>>>>>   used = 1237105080;
> >> >>>>>>  };
> >> >>>>>> NonHeapMemoryUsage = {
> >> >>>>>>   committed = 35061760;
> >> >>>>>>   init = 24313856;
> >> >>>>>>   max = 138412032;
> >> >>>>>>   used = 23151320;
> >> >>>>>>  };
> >> >>>>>> ObjectPendingFinalizationCount = 0;
> >> >>>>>> #mbean = java.lang:name=ParNew,type=GarbageCollector:
> >> >>>>>> LastGcInfo = {
> >> >>>>>>   GcThreadCount = 11;
> >> >>>>>>   duration = 136;
> >> >>>>>>   endTime = 42219272;
> >> >>>>>>   id = 11719;
> >> >>>>>>   memoryUsageAfterGc = {
> >> >>>>>>     ( CMS Perm Gen ) = {
> >> >>>>>>       key = CMS Perm Gen;
> >> >>>>>>       value = {
> >> >>>>>>         committed = 29229056;
> >> >>>>>>         init = 21757952;
> >> >>>>>>         max = 88080384;
> >> >>>>>>         used = 17648848;
> >> >>>>>>        };
> >> >>>>>>      };
> >> >>>>>>     ( Code Cache ) = {
> >> >>>>>>       key = Code Cache;
> >> >>>>>>       value = {
> >> >>>>>>         committed = 5832704;
> >> >>>>>>         init = 2555904;
> >> >>>>>>         max = 50331648;
> >> >>>>>>         used = 5563520;
> >> >>>>>>        };
> >> >>>>>>      };
> >> >>>>>>     ( CMS Old Gen ) = {
> >> >>>>>>       key = CMS Old Gen;
> >> >>>>>>       value = {
> >> >>>>>>         committed = 3594133504;
> >> >>>>>>         init = 112459776;
> >> >>>>>>         max = 4120510464;
> >> >>>>>>         used = 964565720;
> >> >>>>>>        };
> >> >>>>>>      };
> >> >>>>>>     ( Par Eden Space ) = {
> >> >>>>>>       key = Par Eden Space;
> >> >>>>>>       value = {
> >> >>>>>>         committed = 171835392;
> >> >>>>>>         init = 21495808;
> >> >>>>>>         max = 171835392;
> >> >>>>>>         used = 0;
> >> >>>>>>        };
> >> >>>>>>      };
> >> >>>>>>     ( Par Survivor Space ) = {
> >> >>>>>>       key = Par Survivor Space;
> >> >>>>>>       value = {
> >> >>>>>>         committed = 1310720;
> >> >>>>>>         init = 131072;
> >> >>>>>>         max = 1310720;
> >> >>>>>>         used = 0;
> >> >>>>>>        };
> >> >>>>>>      };
> >> >>>>>>    };
> >> >>>>>>   memoryUsageBeforeGc = {
> >> >>>>>>     ( CMS Perm Gen ) = {
> >> >>>>>>       key = CMS Perm Gen;
> >> >>>>>>       value = {
> >> >>>>>>         committed = 29229056;
> >> >>>>>>         init = 21757952;
> >> >>>>>>         max = 88080384;
> >> >>>>>>         used = 17648848;
> >> >>>>>>        };
> >> >>>>>>      };
> >> >>>>>>     ( Code Cache ) = {
> >> >>>>>>       key = Code Cache;
> >> >>>>>>       value = {
> >> >>>>>>         committed = 5832704;
> >> >>>>>>         init = 2555904;
> >> >>>>>>         max = 50331648;
> >> >>>>>>         used = 5563520;
> >> >>>>>>        };
> >> >>>>>>      };
> >> >>>>>>     ( CMS Old Gen ) = {
> >> >>>>>>       key = CMS Old Gen;
> >> >>>>>>       value = {
> >> >>>>>>         committed = 3594133504;
> >> >>>>>>         init = 112459776;
> >> >>>>>>         max = 4120510464;
> >> >>>>>>         used = 959221872;
> >> >>>>>>        };
> >> >>>>>>      };
> >> >>>>>>     ( Par Eden Space ) = {
> >> >>>>>>       key = Par Eden Space;
> >> >>>>>>       value = {
> >> >>>>>>         committed = 171835392;
> >> >>>>>>         init = 21495808;
> >> >>>>>>         max = 171835392;
> >> >>>>>>         used = 171835392;
> >> >>>>>>        };
> >> >>>>>>      };
> >> >>>>>>     ( Par Survivor Space ) = {
> >> >>>>>>       key = Par Survivor Space;
> >> >>>>>>       value = {
> >> >>>>>>         committed = 1310720;
> >> >>>>>>         init = 131072;
> >> >>>>>>         max = 1310720;
> >> >>>>>>         used = 0;
> >> >>>>>>        };
> >> >>>>>>      };
> >> >>>>>>    };
> >> >>>>>>   startTime = 42219136;
> >> >>>>>>  };
> >> >>>>>> CollectionCount = 11720;
> >> >>>>>> CollectionTime = 4561730;
> >> >>>>>> Name = ParNew;
> >> >>>>>> Valid = true;
> >> >>>>>> MemoryPoolNames = [ Par Eden Space, Par Survivor Space ];
> >> >>>>>> #mbean = java.lang:type=OperatingSystem:
> >> >>>>>> MaxFileDescriptorCount = 63536;
> >> >>>>>> OpenFileDescriptorCount = 75;
> >> >>>>>> CommittedVirtualMemorySize = 17787711488;
> >> >>>>>> FreePhysicalMemorySize = 45522944;
> >> >>>>>> FreeSwapSpaceSize = 2123968512;
> >> >>>>>> ProcessCpuTime = 12251460000000;
> >> >>>>>> TotalPhysicalMemorySize = 8364417024;
> >> >>>>>> TotalSwapSpaceSize = 4294959104;
> >> >>>>>> Name = Linux;
> >> >>>>>> AvailableProcessors = 8;
> >> >>>>>> Arch = amd64;
> >> >>>>>> SystemLoadAverage = 4.36;
> >> >>>>>> Version = 2.6.18-164.15.1.el5;
> >> >>>>>> #mbean = java.lang:type=Runtime:
> >> >>>>>> Name = 20281@ob1061.nydc1.outbrain.com;
> >> >>>>>>
> >> >>>>>> ClassPath =
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> /outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../build/classes:/outbrain/cassandra/apache-cassandra-0.6.1/bin/..
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> /lib/antlr-3.1.3.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/apache-cassandra-0.6.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/avro-1.2.0-dev.jar:/outb
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> rain/cassandra/apache-cassandra-0.6.1/bin/../lib/clhm-production.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-cli-1.1.jar:/outbrain/cassandra/apache-cassandra-
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> 0.6.1/bin/../lib/commons-codec-1.2.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-collections-3.2.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/com
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> mons-lang-2.4.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/google-collections-1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/hadoop-core-0.20.1.jar:/out
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> brain/cassandra/apache-cassandra-0.6.1/bin/../lib/high-scale-lib.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/ivy-2.1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> bin/../lib/jackson-core-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jackson-mapper-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jline
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> -0.9.94.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/json-simple-1.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/libthrift-r917130.jar:/outbrain/cassandr
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> a/apache-cassandra-0.6.1/bin/../lib/log4j-1.2.14.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/slf4j-api-1.5.8.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib
> >> >>>>>> /slf4j-log4j12-1.5.8.jar;
> >> >>>>>>
> >> >>>>>> BootClassPath =
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> /usr/java/jdk1.6.0_17/jre/lib/alt-rt.jar:/usr/java/jdk1.6.0_17/jre/lib/resources.jar:/usr/java/jdk1.6.0_17/jre/lib/rt.jar:/usr/java/jdk1.6.0_17/jre/lib/sunrsasign.j
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> ar:/usr/java/jdk1.6.0_17/jre/lib/jsse.jar:/usr/java/jdk1.6.0_17/jre/lib/jce.jar:/usr/java/jdk1.6.0_17/jre/lib/charsets.jar:/usr/java/jdk1.6.0_17/jre/classes;
> >> >>>>>>
> >> >>>>>> LibraryPath =
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> /usr/java/jdk1.6.0_17/jre/lib/amd64/server:/usr/java/jdk1.6.0_17/jre/lib/amd64:/usr/java/jdk1.6.0_17/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib;
> >> >>>>>>
> >> >>>>>> VmName = Java HotSpot(TM) 64-Bit Server VM;
> >> >>>>>>
> >> >>>>>> VmVendor = Sun Microsystems Inc.;
> >> >>>>>>
> >> >>>>>> VmVersion = 14.3-b01;
> >> >>>>>>
> >> >>>>>> BootClassPathSupported = true;
> >> >>>>>>
> >> >>>>>> InputArguments = [ -ea, -Xms128M, -Xmx4G,
> >> >>>>>> -XX:TargetSurvivorRatio=90,
> >> >>>>>> -XX:+AggressiveOpts, -XX:+UseParNewGC, -XX:+UseConcMarkSweepGC,
> >> >>>>>> -XX:+CMSParallelRemarkEnabled, -XX:+HeapDumpOnOutOfMemoryError,
> >> >>>>>> -XX:SurvivorRatio=128, -XX:MaxTenuringThreshold=0,
> >> >>>>>> -Dcom.sun.management.jmxremote.port=9004,
> >> >>>>>> -Dcom.sun.management.jmxremote.ssl=false,
> >> >>>>>> -Dcom.sun.management.jmxremote.authenticate=false,
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> -Dstorage-config=/outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf,
> >> >>>>>> -Dcassandra-pidfile=/var/run/cassandra.pid ];
> >> >>>>>>
> >> >>>>>> ManagementSpecVersion = 1.2;
> >> >>>>>>
> >> >>>>>> SpecName = Java Virtual Machine Specification;
> >> >>>>>>
> >> >>>>>> SpecVendor = Sun Microsystems Inc.;
> >> >>>>>>
> >> >>>>>> SpecVersion = 1.0;
> >> >>>>>>
> >> >>>>>> StartTime = 1272911001415;
> >> >>>>>> ...
> >> >>>>>
> >> >>>
> >> >>>
> >> >>
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > Jonathan Ellis
> >> > Project Chair, Apache Cassandra
> >> > co-founder of Riptano, the source for professional Cassandra support
> >> > http://riptano.com
> >>
> >
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>

Re: performance tuning - where does the slowness come from?

Posted by Jonathan Ellis <jb...@gmail.com>.

 - your key cache isn't warm.  capacity 17M, size 0.5M, 468083 reads
sounds like most of your reads have been for unique keys.
 - the kind of reads you are doing can have a big effect (mostly
number of columns you are asking for).  column index granularity plays
a role (for non-rowcached reads); so can column comparator (see e.g.
https://issues.apache.org/jira/browse/CASSANDRA-1043)
 - the slow system reads are all on HH rows, which can get very wide
(hence, slow to read the whole row, which is what the HH code does).
clean those out either by bringing back the nodes it's hinting for, or
just removing the HH data files.

On Wed, May 5, 2010 at 10:19 AM, Ran Tavory <ra...@gmail.com> wrote:
> I'm still trying to figure out where my slowness is coming from...
> By now I'm pretty sure it's the reads are slow, but not sure how to improve
> them.
> I'm looking at cfstats. Can you say if there are better configuration
> options? So far I've used all default settings, except for:
>     <Keyspace Name="outbrain_kvdb">
>       <ColumnFamily CompareWith="BytesType" Name="KvImpressions"
> KeysCached="50%"/>
>
>  <ReplicaPlacementStrategy>org.apache.cassandra.locator.RackAwareStrategy</ReplicaPlacementStrategy>
>       <ReplicationFactor>2</ReplicationFactor>
>
>  <EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch>
>     </Keyspace>
>
> What does a good read latency look like? I was expecting 10ms, however so
> far it seems that my KvImpressions read latency is 30ms and in the system
> keyspace I have 800ms :(
> I thought adding KeysCached="50%" would improve my situation but
> unfortunately looks like the hitrate is about 0. I realize that's
> application specific, but maybe there are other magic bullets...
> Is there something like adding cache to the system keyspace? 800 ms is
> pretty bad, isn't it?
> See stats below and thanks.
>
> Keyspace: outbrain_kvdb
>         Read Count: 651668
>         Read Latency: 34.18622328547666 ms.
>         Write Count: 655542
>         Write Latency: 0.041145092152752985 ms.
>         Pending Tasks: 0
>                 Column Family: KvImpressions
>                 SSTable count: 13
>                 Space used (live): 23304548897
>                 Space used (total): 23304548897
>                 Memtable Columns Count: 895
>                 Memtable Data Size: 2108990
>                 Memtable Switch Count: 8
>                 Read Count: 468083
>                 Read Latency: 151.603 ms.
>                 Write Count: 552566
>                 Write Latency: 0.023 ms.
>                 Pending Tasks: 0
>                 Key cache capacity: 17398656
>                 Key cache size: 567967
>                 Key cache hit rate: 0.0
>                 Row cache: disabled
>                 Compacted row minimum size: 269
>                 Compacted row maximum size: 54501
>                 Compacted row mean size: 933
> ...
> ----------------
> Keyspace: system
>         Read Count: 1151
>         Read Latency: 872.5014448305822 ms.
>         Write Count: 51215
>         Write Latency: 0.07156788050375866 ms.
>         Pending Tasks: 0
>                 Column Family: HintsColumnFamily
>                 SSTable count: 5
>                 Space used (live): 437366878
>                 Space used (total): 437366878
>                 Memtable Columns Count: 14987
>                 Memtable Data Size: 87975
>                 Memtable Switch Count: 2
>                 Read Count: 1150
>                 Read Latency: NaN ms.
>                 Write Count: 51211
>                 Write Latency: 0.027 ms.
>                 Pending Tasks: 0
>                 Key cache capacity: 6
>                 Key cache size: 4
>                 Key cache hit rate: NaN
>                 Row cache: disabled
>                 Compacted row minimum size: 0
>                 Compacted row maximum size: 0
>                 Compacted row mean size: 0
>                 Column Family: LocationInfo
>                 SSTable count: 2
>                 Space used (live): 3504
>                 Space used (total): 3504
>                 Memtable Columns Count: 0
>                 Memtable Data Size: 0
>                 Memtable Switch Count: 1
>                 Read Count: 1
>                 Read Latency: NaN ms.
>                 Write Count: 7
>                 Write Latency: NaN ms.
>                 Pending Tasks: 0
>                 Key cache capacity: 2
>                 Key cache size: 1
>                 Key cache hit rate: NaN
>                 Row cache: disabled
>                 Compacted row minimum size: 0
>                 Compacted row maximum size: 0
>                 Compacted row mean size: 0
>
> On Tue, May 4, 2010 at 10:57 PM, Kyusik Chung <ky...@discovereads.com>
> wrote:
>>
>> Im using Ubuntu 8.04 on 64 bit hosts on rackspace cloud.
>>
>> Im in the middle of repeating some perf tests, but so far, I get as-good
>> or slightly better read perf by using standard disk access mode vs mmap.  So
>> far consecutive tests are returning consistent numbers.
>>
>> Im not sure how to explain it...maybe its an ubuntu 8.04 issue with mmap.
>>  Back when I was using mmap, I was definitely seeing the kswapd0 process
>> start using cpu as the box ran out of memory, and read performance
>> significantly degraded.
>>
>> Next, Ill run some tests with mmap_index_only, and Ill test with heavy
>> concurrent writes as well as reads.  Ill let everyone know what I find.
>>
>> Kyusik Chung
>> CEO, Discovereads.com
>> kyusik@discovereads.com
>>
>> On May 4, 2010, at 12:27 PM, Jonathan Ellis wrote:
>>
>> > Are you using 32 bit hosts?  If not don't be scared of mmap using a
>> > lot of address space, you have plenty.  It won't make you swap more
>> > than using buffered i/o.
>> >
>> > On Tue, May 4, 2010 at 1:57 PM, Ran Tavory <ra...@gmail.com> wrote:
>> >> I canceled mmap and indeed memory usage is sane again. So far
>> >> performance
>> >> hasn't been great, but I'll wait and see.
>> >> I'm also interested in a way to cap mmap so I can take advantage of it
>> >> but
>> >> not swap the host to death...
>> >>
>> >> On Tue, May 4, 2010 at 9:38 PM, Kyusik Chung <ky...@discovereads.com>
>> >> wrote:
>> >>>
>> >>> This sounds just like the slowness I was asking about in another
>> >>> thread -
>> >>> after a lot of reads, the machine uses up all available memory on the
>> >>> box
>> >>> and then starts swapping.
>> >>> My understanding was that mmap helps greatly with read and write perf
>> >>> (until the box starts swapping I guess)...is there any way to use mmap
>> >>> and
>> >>> cap how much memory it takes up?
>> >>> What do people use in production?  mmap or no mmap?
>> >>> Thanks!
>> >>> Kyusik Chung
>> >>> On May 4, 2010, at 10:11 AM, Schubert Zhang wrote:
>> >>>
>> >>> 1. When initially startup your nodes, please plan your InitialToken of
>> >>> each node evenly.
>> >>> 2. <DiskAccessMode>standard</DiskAccessMode>
>> >>>
>> >>> On Tue, May 4, 2010 at 9:09 PM, Boris Shulman <sh...@gmail.com>
>> >>> wrote:
>> >>>>
>> >>>> I think that the extra (more than 4GB) memory usage comes from the
>> >>>> mmaped io, that is why it happens only for reads.
>> >>>>
>> >>>> On Tue, May 4, 2010 at 2:02 PM, Jordan Pittier
>> >>>> <jo...@gmail.com>
>> >>>> wrote:
>> >>>>> I'm facing the same issue with swap. It only occurs when I perform
>> >>>>> read
>> >>>>> operations (write are very fast :)). So I can't help you with the
>> >>>>> memory
>> >>>>> probleme.
>> >>>>>
>> >>>>> But to balance the load evenly between nodes in cluster just
>> >>>>> manually
>> >>>>> fix
>> >>>>> their token.(the "formula" is i * 2^127 / nb_nodes).
>> >>>>>
>> >>>>> Jordzn
>> >>>>>
>> >>>>> On Tue, May 4, 2010 at 8:20 AM, Ran Tavory <ra...@gmail.com> wrote:
>> >>>>>>
>> >>>>>> I'm looking into performance issues on a 0.6.1 cluster. I see two
>> >>>>>> symptoms:
>> >>>>>> 1. Reads and writes are slow
>> >>>>>> 2. One of the hosts is doing a lot of GC.
>> >>>>>> 1 is slow in the sense that in normal state the cluster used to
>> >>>>>> make
>> >>>>>> around 3-5k read and writes per second (6-10k operations per
>> >>>>>> second),
>> >>>>>> but
>> >>>>>> how it's in the order of 200-400 ops per second, sometimes even
>> >>>>>> less.
>> >>>>>> 2 looks like this:
>> >>>>>> $ tail -f /outbrain/cassandra/log/system.log
>> >>>>>>  INFO [GC inspection] 2010-05-04 00:42:18,636 GCInspector.java
>> >>>>>> (line
>> >>>>>> 110)
>> >>>>>> GC for ParNew: 672 ms, 166482384 reclaimed leaving 2872087208 used;
>> >>>>>> max is
>> >>>>>> 4432068608
>> >>>>>>  INFO [GC inspection] 2010-05-04 00:42:28,638 GCInspector.java
>> >>>>>> (line
>> >>>>>> 110)
>> >>>>>> GC for ParNew: 498 ms, 166493352 reclaimed leaving 2836049448 used;
>> >>>>>> max is
>> >>>>>> 4432068608
>> >>>>>>  INFO [GC inspection] 2010-05-04 00:42:38,640 GCInspector.java
>> >>>>>> (line
>> >>>>>> 110)
>> >>>>>> GC for ParNew: 327 ms, 166091528 reclaimed leaving 2796888424 used;
>> >>>>>> max is
>> >>>>>> 4432068608
>> >>>>>> ... and it goes on and on for hours, no stopping...
>> >>>>>> The cluster is made of 6 hosts, 3 in one DC and 3 in another.
>> >>>>>> Each host has 8G RAM.
>> >>>>>> -Xmx=4G
>> >>>>>> For some reason, the load isn't distributed evenly b/w the hosts,
>> >>>>>> although
>> >>>>>> I'm not sure this is the cause for slowness
>> >>>>>> $ nodetool -h localhost -p 9004 ring
>> >>>>>> Address       Status     Load          Range
>> >>>>>>        Ring
>> >>>>>>
>> >>>>>> 144413773383729447702215082383444206680
>> >>>>>> 192.168.252.99Up         15.94 GB
>> >>>>>>  66002764663998929243644931915471302076     |<--|
>> >>>>>> 192.168.254.57Up         19.84 GB
>> >>>>>>  81288739225600737067856268063987022738     |   ^
>> >>>>>> 192.168.254.58Up         973.78 MB
>> >>>>>> 86999744104066390588161689990810839743     v   |
>> >>>>>> 192.168.252.62Up         5.18 GB
>> >>>>>> 88308919879653155454332084719458267849     |   ^
>> >>>>>> 192.168.254.59Up         10.57 GB
>> >>>>>>  142482163220375328195837946953175033937    v   |
>> >>>>>> 192.168.252.61Up         11.36 GB
>> >>>>>>  144413773383729447702215082383444206680    |-->|
>> >>>>>> The slow host is 192.168.252.61 and it isn't the most loaded one.
>> >>>>>> The host is waiting a lot on IO and the load average is usually 6-7
>> >>>>>> $ w
>> >>>>>>  00:42:56 up 11 days, 13:22,  1 user,  load average: 6.21, 5.52,
>> >>>>>> 3.93
>> >>>>>> $ vmstat 5
>> >>>>>> procs -----------memory---------- ---swap-- -----io---- --system--
>> >>>>>> -----cpu------
>> >>>>>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs
>> >>>>>> us
>> >>>>>> sy id
>> >>>>>> wa st
>> >>>>>>  0  8 2147844  45744   1816 4457384    6    5    66    32    5    2
>> >>>>>>  1
>> >>>>>>  1
>> >>>>>> 96  2  0
>> >>>>>>  0  8 2147164  49020   1808 4451596  385    0  2345    58 3372 9957
>> >>>>>>  2
>> >>>>>>  2
>> >>>>>> 78 18  0
>> >>>>>>  0  3 2146432  45704   1812 4453956  342    0  2274   108 3937
>> >>>>>> 10732
>> >>>>>>  2  2
>> >>>>>> 78 19  0
>> >>>>>>  0  1 2146252  44696   1804 4453436  345  164  1939   294 3647 7833
>> >>>>>>  2
>> >>>>>>  2
>> >>>>>> 78 18  0
>> >>>>>>  0  1 2145960  46924   1744 4451260  158    0  2423   122 4354
>> >>>>>> 14597
>> >>>>>>  2  2
>> >>>>>> 77 18  0
>> >>>>>>  7  1 2138344  44676    952 4504148 1722  403  1722   406 1388  439
>> >>>>>> 87
>> >>>>>>  0
>> >>>>>> 10  2  0
>> >>>>>>  7  2 2137248  45652    956 4499436 1384  655  1384   658 1356  392
>> >>>>>> 87
>> >>>>>>  0
>> >>>>>> 10  3  0
>> >>>>>>  7  1 2135976  46764    956 4495020 1366  718  1366   718 1395  380
>> >>>>>> 87
>> >>>>>>  0
>> >>>>>>  9  4  0
>> >>>>>>  0  8 2134484  46964    956 4489420 1673  555  1814   586 1601
>> >>>>>> 215590
>> >>>>>> 14
>> >>>>>>  2 68 16  0
>> >>>>>>  0  1 2135388  47444    972 4488516  785  833  2390   995 3812 8305
>> >>>>>>  2
>> >>>>>>  2
>> >>>>>> 77 20  0
>> >>>>>>  0 10 2135164  45928    980 4488796  788  543  2275   626 36
>> >>>>>> So, the host is swapping like crazy...
>> >>>>>> top shows that it's using a lot of memory. As noted before -Xmx=4G
>> >>>>>> and
>> >>>>>> nothing else seems to be using a lot of memory on the host except
>> >>>>>> for
>> >>>>>> the
>> >>>>>> cassandra process, however, of the 8G ram on the host, 92% is used
>> >>>>>> by
>> >>>>>> cassandra. How's that?
>> >>>>>> Top shows there's 3.9g Shared and 7.2g Resident and 15.9g Virtual.
>> >>>>>> Why
>> >>>>>> does it have 15g virtual? And why 7.2 RES? This can explain the
>> >>>>>> slowness in
>> >>>>>> swapping.
>> >>>>>> $ top
>> >>>>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
>> >>>>>>  COMMAND
>> >>>>>>
>> >>>>>>
>> >>>>>> 20281 cassandr  25   0 15.9g 7.2g 3.9g S 33.3 92.6 175:30.27 java
>> >>>>>> So, can the total memory be controlled?
>> >>>>>> Or perhaps I'm looking in the wrong direction...
>> >>>>>> I've looked at all the cassandra JMX counts and nothing seemed
>> >>>>>> suspicious
>> >>>>>> so far. By suspicious i mean a large number of pending tasks -
>> >>>>>> there
>> >>>>>> were
>> >>>>>> always very small numbers in each pool.
>> >>>>>> About read and write latencies, I'm not sure what the normal state
>> >>>>>> is,
>> >>>>>> but
>> >>>>>> here's an example of what I see on the problematic host:
>> >>>>>> #mbean = org.apache.cassandra.service:type=StorageProxy:
>> >>>>>> RecentReadLatencyMicros = 30105.888180684495;
>> >>>>>> TotalReadLatencyMicros = 78543052801;
>> >>>>>> TotalWriteLatencyMicros = 4213118609;
>> >>>>>> RecentWriteLatencyMicros = 1444.4809201925639;
>> >>>>>> ReadOperations = 4779553;
>> >>>>>> RangeOperations = 0;
>> >>>>>> TotalRangeLatencyMicros = 0;
>> >>>>>> RecentRangeLatencyMicros = NaN;
>> >>>>>> WriteOperations = 4740093;
>> >>>>>> And the only pool that I do see some pending tasks is the
>> >>>>>> ROW-READ-STAGE,
>> >>>>>> but it doesn't look like much, usually around 6-8:
>> >>>>>> #mbean = org.apache.cassandra.concurrent:type=ROW-READ-STAGE:
>> >>>>>> ActiveCount = 8;
>> >>>>>> PendingTasks = 8;
>> >>>>>> CompletedTasks = 5427955;
>> >>>>>> Any help finding the solution is appreciated, thanks...
>> >>>>>> Below are a few more JMXes I collected from the system that may be
>> >>>>>> interesting.
>> >>>>>> #mbean = java.lang:type=Memory:
>> >>>>>> Verbose = false;
>> >>>>>> HeapMemoryUsage = {
>> >>>>>>   committed = 3767279616;
>> >>>>>>   init = 134217728;
>> >>>>>>   max = 4293656576;
>> >>>>>>   used = 1237105080;
>> >>>>>>  };
>> >>>>>> NonHeapMemoryUsage = {
>> >>>>>>   committed = 35061760;
>> >>>>>>   init = 24313856;
>> >>>>>>   max = 138412032;
>> >>>>>>   used = 23151320;
>> >>>>>>  };
>> >>>>>> ObjectPendingFinalizationCount = 0;
>> >>>>>> #mbean = java.lang:name=ParNew,type=GarbageCollector:
>> >>>>>> LastGcInfo = {
>> >>>>>>   GcThreadCount = 11;
>> >>>>>>   duration = 136;
>> >>>>>>   endTime = 42219272;
>> >>>>>>   id = 11719;
>> >>>>>>   memoryUsageAfterGc = {
>> >>>>>>     ( CMS Perm Gen ) = {
>> >>>>>>       key = CMS Perm Gen;
>> >>>>>>       value = {
>> >>>>>>         committed = 29229056;
>> >>>>>>         init = 21757952;
>> >>>>>>         max = 88080384;
>> >>>>>>         used = 17648848;
>> >>>>>>        };
>> >>>>>>      };
>> >>>>>>     ( Code Cache ) = {
>> >>>>>>       key = Code Cache;
>> >>>>>>       value = {
>> >>>>>>         committed = 5832704;
>> >>>>>>         init = 2555904;
>> >>>>>>         max = 50331648;
>> >>>>>>         used = 5563520;
>> >>>>>>        };
>> >>>>>>      };
>> >>>>>>     ( CMS Old Gen ) = {
>> >>>>>>       key = CMS Old Gen;
>> >>>>>>       value = {
>> >>>>>>         committed = 3594133504;
>> >>>>>>         init = 112459776;
>> >>>>>>         max = 4120510464;
>> >>>>>>         used = 964565720;
>> >>>>>>        };
>> >>>>>>      };
>> >>>>>>     ( Par Eden Space ) = {
>> >>>>>>       key = Par Eden Space;
>> >>>>>>       value = {
>> >>>>>>         committed = 171835392;
>> >>>>>>         init = 21495808;
>> >>>>>>         max = 171835392;
>> >>>>>>         used = 0;
>> >>>>>>        };
>> >>>>>>      };
>> >>>>>>     ( Par Survivor Space ) = {
>> >>>>>>       key = Par Survivor Space;
>> >>>>>>       value = {
>> >>>>>>         committed = 1310720;
>> >>>>>>         init = 131072;
>> >>>>>>         max = 1310720;
>> >>>>>>         used = 0;
>> >>>>>>        };
>> >>>>>>      };
>> >>>>>>    };
>> >>>>>>   memoryUsageBeforeGc = {
>> >>>>>>     ( CMS Perm Gen ) = {
>> >>>>>>       key = CMS Perm Gen;
>> >>>>>>       value = {
>> >>>>>>         committed = 29229056;
>> >>>>>>         init = 21757952;
>> >>>>>>         max = 88080384;
>> >>>>>>         used = 17648848;
>> >>>>>>        };
>> >>>>>>      };
>> >>>>>>     ( Code Cache ) = {
>> >>>>>>       key = Code Cache;
>> >>>>>>       value = {
>> >>>>>>         committed = 5832704;
>> >>>>>>         init = 2555904;
>> >>>>>>         max = 50331648;
>> >>>>>>         used = 5563520;
>> >>>>>>        };
>> >>>>>>      };
>> >>>>>>     ( CMS Old Gen ) = {
>> >>>>>>       key = CMS Old Gen;
>> >>>>>>       value = {
>> >>>>>>         committed = 3594133504;
>> >>>>>>         init = 112459776;
>> >>>>>>         max = 4120510464;
>> >>>>>>         used = 959221872;
>> >>>>>>        };
>> >>>>>>      };
>> >>>>>>     ( Par Eden Space ) = {
>> >>>>>>       key = Par Eden Space;
>> >>>>>>       value = {
>> >>>>>>         committed = 171835392;
>> >>>>>>         init = 21495808;
>> >>>>>>         max = 171835392;
>> >>>>>>         used = 171835392;
>> >>>>>>        };
>> >>>>>>      };
>> >>>>>>     ( Par Survivor Space ) = {
>> >>>>>>       key = Par Survivor Space;
>> >>>>>>       value = {
>> >>>>>>         committed = 1310720;
>> >>>>>>         init = 131072;
>> >>>>>>         max = 1310720;
>> >>>>>>         used = 0;
>> >>>>>>        };
>> >>>>>>      };
>> >>>>>>    };
>> >>>>>>   startTime = 42219136;
>> >>>>>>  };
>> >>>>>> CollectionCount = 11720;
>> >>>>>> CollectionTime = 4561730;
>> >>>>>> Name = ParNew;
>> >>>>>> Valid = true;
>> >>>>>> MemoryPoolNames = [ Par Eden Space, Par Survivor Space ];
>> >>>>>> #mbean = java.lang:type=OperatingSystem:
>> >>>>>> MaxFileDescriptorCount = 63536;
>> >>>>>> OpenFileDescriptorCount = 75;
>> >>>>>> CommittedVirtualMemorySize = 17787711488;
>> >>>>>> FreePhysicalMemorySize = 45522944;
>> >>>>>> FreeSwapSpaceSize = 2123968512;
>> >>>>>> ProcessCpuTime = 12251460000000;
>> >>>>>> TotalPhysicalMemorySize = 8364417024;
>> >>>>>> TotalSwapSpaceSize = 4294959104;
>> >>>>>> Name = Linux;
>> >>>>>> AvailableProcessors = 8;
>> >>>>>> Arch = amd64;
>> >>>>>> SystemLoadAverage = 4.36;
>> >>>>>> Version = 2.6.18-164.15.1.el5;
>> >>>>>> #mbean = java.lang:type=Runtime:
>> >>>>>> Name = 20281@ob1061.nydc1.outbrain.com;
>> >>>>>>
>> >>>>>> ClassPath =
>> >>>>>>
>> >>>>>>
>> >>>>>> /outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../build/classes:/outbrain/cassandra/apache-cassandra-0.6.1/bin/..
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> /lib/antlr-3.1.3.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/apache-cassandra-0.6.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/avro-1.2.0-dev.jar:/outb
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> rain/cassandra/apache-cassandra-0.6.1/bin/../lib/clhm-production.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-cli-1.1.jar:/outbrain/cassandra/apache-cassandra-
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> 0.6.1/bin/../lib/commons-codec-1.2.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-collections-3.2.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/com
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> mons-lang-2.4.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/google-collections-1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/hadoop-core-0.20.1.jar:/out
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> brain/cassandra/apache-cassandra-0.6.1/bin/../lib/high-scale-lib.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/ivy-2.1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> bin/../lib/jackson-core-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jackson-mapper-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jline
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> -0.9.94.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/json-simple-1.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/libthrift-r917130.jar:/outbrain/cassandr
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> a/apache-cassandra-0.6.1/bin/../lib/log4j-1.2.14.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/slf4j-api-1.5.8.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib
>> >>>>>> /slf4j-log4j12-1.5.8.jar;
>> >>>>>>
>> >>>>>> BootClassPath =
>> >>>>>>
>> >>>>>>
>> >>>>>> /usr/java/jdk1.6.0_17/jre/lib/alt-rt.jar:/usr/java/jdk1.6.0_17/jre/lib/resources.jar:/usr/java/jdk1.6.0_17/jre/lib/rt.jar:/usr/java/jdk1.6.0_17/jre/lib/sunrsasign.j
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> ar:/usr/java/jdk1.6.0_17/jre/lib/jsse.jar:/usr/java/jdk1.6.0_17/jre/lib/jce.jar:/usr/java/jdk1.6.0_17/jre/lib/charsets.jar:/usr/java/jdk1.6.0_17/jre/classes;
>> >>>>>>
>> >>>>>> LibraryPath =
>> >>>>>>
>> >>>>>>
>> >>>>>> /usr/java/jdk1.6.0_17/jre/lib/amd64/server:/usr/java/jdk1.6.0_17/jre/lib/amd64:/usr/java/jdk1.6.0_17/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib;
>> >>>>>>
>> >>>>>> VmName = Java HotSpot(TM) 64-Bit Server VM;
>> >>>>>>
>> >>>>>> VmVendor = Sun Microsystems Inc.;
>> >>>>>>
>> >>>>>> VmVersion = 14.3-b01;
>> >>>>>>
>> >>>>>> BootClassPathSupported = true;
>> >>>>>>
>> >>>>>> InputArguments = [ -ea, -Xms128M, -Xmx4G,
>> >>>>>> -XX:TargetSurvivorRatio=90,
>> >>>>>> -XX:+AggressiveOpts, -XX:+UseParNewGC, -XX:+UseConcMarkSweepGC,
>> >>>>>> -XX:+CMSParallelRemarkEnabled, -XX:+HeapDumpOnOutOfMemoryError,
>> >>>>>> -XX:SurvivorRatio=128, -XX:MaxTenuringThreshold=0,
>> >>>>>> -Dcom.sun.management.jmxremote.port=9004,
>> >>>>>> -Dcom.sun.management.jmxremote.ssl=false,
>> >>>>>> -Dcom.sun.management.jmxremote.authenticate=false,
>> >>>>>>
>> >>>>>>
>> >>>>>> -Dstorage-config=/outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf,
>> >>>>>> -Dcassandra-pidfile=/var/run/cassandra.pid ];
>> >>>>>>
>> >>>>>> ManagementSpecVersion = 1.2;
>> >>>>>>
>> >>>>>> SpecName = Java Virtual Machine Specification;
>> >>>>>>
>> >>>>>> SpecVendor = Sun Microsystems Inc.;
>> >>>>>>
>> >>>>>> SpecVersion = 1.0;
>> >>>>>>
>> >>>>>> StartTime = 1272911001415;
>> >>>>>> ...
>> >>>>>
>> >>>
>> >>>
>> >>
>> >>
>> >
>> >
>> >
>> > --
>> > Jonathan Ellis
>> > Project Chair, Apache Cassandra
>> > co-founder of Riptano, the source for professional Cassandra support
>> > http://riptano.com
>>
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: performance tuning - where does the slowness come from?

Posted by Vick Khera <vi...@khera.org>.

On Thu, May 6, 2010 at 2:05 PM, Weijun Li <we...@gmail.com> wrote:
> Anyway, for mmap, in order for you to access the data in the buffer or
> virtual address, OS has to read/page in the data to a block of physical
> memory and assign your virtual address to that physical memory block. So if
> you use random partitioner you'll most likely force Linux to page in/out all
> the time. In this case, disabling mmap and let Cassandra to use random file
> access seems to make more sense. mmap should be used when you have enough
> ram for OS to cache most or all of your data files.
>

You pay the price of disk I/O and cache with or without mmap, don't
you?  If you're just reading the data, then there is no page-out
necessary.  Just mmap'ing a file does not cause it to be read in its
entirety into the cache.

Re: performance tuning - where does the slowness come from?

Posted by Kyusik Chung <ky...@discovereads.com>.

Id like to add one caveat to Weijun's statement.  I agree with everything, except if your access pattern doesnt look like a random sampling of data across all your sstables.  If it turns out that at any given time, you're doing many repeated hits to a smaller subset of keys, then using mmap even if your live sstables are much larger than available memory should be ok.  The key is to have enough memory available (pre-mmap) so that there are few page-in operations relative to client read requests.

Also, I suppose if you dont have a lot of repeat hits per key, mmap prob doesnt buy you a ton either, unless your rows are very skinny and lots of them fit in a page - as far as I can tell, linux lazily pages in data thats been mmap-ed.

(apologies for describing mmap inaccurately earlier in the thread)

Kyusik Chung

On May 6, 2010, at 11:05 AM, Weijun Li wrote:

> I just used Linux "Top" to see the number of virtual memory used by JVM. When you turned on mmap, this number is equal to the size of your live sstables. And if you turn off mmap the VIRT will be close to the xmx of your jvm.
> 
> Anyway, for mmap, in order for you to access the data in the buffer or virtual address, OS has to read/page in the data to a block of physical memory and assign your virtual address to that physical memory block. So if you use random partitioner you'll most likely force Linux to page in/out all the time. In this case, disabling mmap and let Cassandra to use random file access seems to make more sense. mmap should be used when you have enough ram for OS to cache most or all of your data files.
> 
> -Weijun
> 
> On Thu, May 6, 2010 at 10:49 AM, Vick Khera <vi...@khera.org> wrote:
> On Thu, May 6, 2010 at 1:06 PM, Weijun Li <we...@gmail.com> wrote:
> > In this case using mmap will cause Cassandra to use sometimes > 100G virtual
> > memory which is much more than the physical ram, since we are using random
> > partitioner the OS will be busy doing swap.
> 
> mmap uses the virtual address space to reference bits on the disk; it
> does *NOT* use physical or virtual memory to copy that data other than
> perhaps any disk buffer cache from reading the file (which you would
> have anyhow).  Your memory usage tools will report high memory usage
> because they tell you how much virtual address space you have
> allocated.
>

Re: performance tuning - where does the slowness come from?

Posted by Weijun Li <we...@gmail.com>.

I just used Linux "Top" to see the number of virtual memory used by JVM.
When you turned on mmap, this number is equal to the size of your live
sstables. And if you turn off mmap the VIRT will be close to the xmx of your
jvm.

Anyway, for mmap, in order for you to access the data in the buffer or
virtual address, OS has to read/page in the data to a block of physical
memory and assign your virtual address to that physical memory block. So if
you use random partitioner you'll most likely force Linux to page in/out all
the time. In this case, disabling mmap and let Cassandra to use random file
access seems to make more sense. mmap should be used when you have enough
ram for OS to cache most or all of your data files.

-Weijun

On Thu, May 6, 2010 at 10:49 AM, Vick Khera <vi...@khera.org> wrote:

> On Thu, May 6, 2010 at 1:06 PM, Weijun Li <we...@gmail.com> wrote:
> > In this case using mmap will cause Cassandra to use sometimes > 100G
> virtual
> > memory which is much more than the physical ram, since we are using
> random
> > partitioner the OS will be busy doing swap.
>
> mmap uses the virtual address space to reference bits on the disk; it
> does *NOT* use physical or virtual memory to copy that data other than
> perhaps any disk buffer cache from reading the file (which you would
> have anyhow).  Your memory usage tools will report high memory usage
> because they tell you how much virtual address space you have
> allocated.
>

Re: performance tuning - where does the slowness come from?

Posted by Vick Khera <vi...@khera.org>.

On Thu, May 6, 2010 at 1:06 PM, Weijun Li <we...@gmail.com> wrote:
> In this case using mmap will cause Cassandra to use sometimes > 100G virtual
> memory which is much more than the physical ram, since we are using random
> partitioner the OS will be busy doing swap.

mmap uses the virtual address space to reference bits on the disk; it
does *NOT* use physical or virtual memory to copy that data other than
perhaps any disk buffer cache from reading the file (which you would
have anyhow).  Your memory usage tools will report high memory usage
because they tell you how much virtual address space you have
allocated.

Re: performance tuning - where does the slowness come from?

Posted by Artie Copeland <ye...@gmail.com>.

Weijunli,

I also have an environment that has similar very large datasets with strict
latency.  Can you please elaborate on the custom changes you added to
cassandra to meet these sla, either code or configuration.  i am very
interested in learning more about the internal workings of cassandra and
performance.

Thanx,
Artie

On Thu, May 6, 2010 at 10:06 AM, Weijun Li <we...@gmail.com> wrote:

> Our use case is a little different: our server is a typical high volume
> transaction server that processes more than half billion requests per day.
> The write/read ratio is close to 1, and the cluster needs to serve >10k
> write+read with strict latency (<20ms) otherwise the client will treat it as
> failure. Plus we have hundreds of millions of keys so the generated sstable
> files are much bigger that the ram size. In this case using mmap will cause
> Cassandra to use sometimes > 100G virtual memory which is much more than the
> physical ram, since we are using random partitioner the OS will be busy
> doing swap.
>
> I have finally customized cassandra to meet the above requirement by using
> cheap hardware (32G ram + SATA drives): one thing I learned is that you have
> to carefully avoid swapping, especially when you need to cache most of the
> keys in memory, swap can easily damage the performance of your in memory
> cache. I also made some optimizations to reduce memory/disk consumption and
> to make it easier for us to diagnose issues. In one word: cassandra is very
> well written but there's still huge potential for you to improve it to meet
> your special requirements.
>
> -Weijun
>
>
> On Wed, May 5, 2010 at 9:43 AM, Jordan Pittier <jo...@gmail.com>wrote:
>
>> I disagree. Swapping could be avoided. I don't know Cassandra internals
>> mechanisms but what I am expecting is that whenever I want to read rows that
>> are not in RAM, Cassandra load them from hard drive to RAM if space is
>> available, and, if RAM is full to reply my query without saving rows in RAM.
>> No need for swapping.
>>
>> I have no try yet to change DiskAccessMode to standard, I hope it will
>> help me.
>>
>> Another thing : please dont post your benchmark figures without any
>> explanation on the work load generator or your cluster settings. It really
>> doesn't make any sense...
>>
>>
>> On Wed, May 5, 2010 at 6:16 PM, Weijun Li <we...@gmail.com> wrote:
>>
>>> When you have much more data than you can hold in memory, it will be
>>> difficult for you to get around of swap which will most likely ruin your
>>> performance. Also in this case mmap doesn't seem to make much sense if you
>>> use random partitioner which will end up with crazy swap too. However we
>>> found a way to get around read/write performance issue by integrating
>>> memcached into Cassandra: in this case you need to ask memcached to disable
>>> disk swap so you can achieve move than 10k read+write with milli-second
>>> level of latency. Actually this is the only way that we figured out that can
>>> gracefully solve the performance and memory issue.
>>>
>>> -Weijun
>>>
>>>
>>> On Wed, May 5, 2010 at 8:19 AM, Ran Tavory <ra...@gmail.com> wrote:
>>>
>>>> I'm still trying to figure out where my slowness is coming from...
>>>> By now I'm pretty sure it's the reads are slow, but not sure how to
>>>> improve them.
>>>>
>>>> I'm looking at cfstats. Can you say if there are better configuration
>>>> options? So far I've used all default settings, except for:
>>>>
>>>>     <Keyspace Name="outbrain_kvdb">
>>>>       <ColumnFamily CompareWith="BytesType" Name="KvImpressions"
>>>> KeysCached="50%"/>
>>>>
>>>>  <ReplicaPlacementStrategy>org.apache.cassandra.locator.RackAwareStrategy</ReplicaPlacementStrategy>
>>>>       <ReplicationFactor>2</ReplicationFactor>
>>>>
>>>>  <EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch>
>>>>     </Keyspace>
>>>>
>>>>
>>>> What does a good read latency look like? I was expecting 10ms, however
>>>> so far it seems that my KvImpressions read latency is 30ms and in the system
>>>> keyspace I have 800ms :(
>>>> I thought adding KeysCached="50%" would improve my situation but
>>>> unfortunately looks like the hitrate is about 0. I realize that's
>>>> application specific, but maybe there are other magic bullets...
>>>>
>>>> Is there something like adding cache to the system keyspace? 800 ms is
>>>> pretty bad, isn't it?
>>>>
>>>> See stats below and thanks.
>>>>
>>>>
>>>> Keyspace: outbrain_kvdb
>>>>         Read Count: 651668
>>>>         Read Latency: 34.18622328547666 ms.
>>>>         Write Count: 655542
>>>>         Write Latency: 0.041145092152752985 ms.
>>>>         Pending Tasks: 0
>>>>                 Column Family: KvImpressions
>>>>                 SSTable count: 13
>>>>                 Space used (live): 23304548897
>>>>                 Space used (total): 23304548897
>>>>                 Memtable Columns Count: 895
>>>>                 Memtable Data Size: 2108990
>>>>                 Memtable Switch Count: 8
>>>>                 Read Count: 468083
>>>>                 Read Latency: 151.603 ms.
>>>>                 Write Count: 552566
>>>>                 Write Latency: 0.023 ms.
>>>>                 Pending Tasks: 0
>>>>                 Key cache capacity: 17398656
>>>>                 Key cache size: 567967
>>>>                 Key cache hit rate: 0.0
>>>>                 Row cache: disabled
>>>>                 Compacted row minimum size: 269
>>>>                 Compacted row maximum size: 54501
>>>>                 Compacted row mean size: 933
>>>> ...
>>>> ----------------
>>>> Keyspace: system
>>>>         Read Count: 1151
>>>>         Read Latency: 872.5014448305822 ms.
>>>>         Write Count: 51215
>>>>         Write Latency: 0.07156788050375866 ms.
>>>>         Pending Tasks: 0
>>>>                 Column Family: HintsColumnFamily
>>>>                 SSTable count: 5
>>>>                 Space used (live): 437366878
>>>>                 Space used (total): 437366878
>>>>                 Memtable Columns Count: 14987
>>>>                 Memtable Data Size: 87975
>>>>                 Memtable Switch Count: 2
>>>>                 Read Count: 1150
>>>>                 Read Latency: NaN ms.
>>>>                 Write Count: 51211
>>>>                 Write Latency: 0.027 ms.
>>>>                 Pending Tasks: 0
>>>>                 Key cache capacity: 6
>>>>                 Key cache size: 4
>>>>                 Key cache hit rate: NaN
>>>>                 Row cache: disabled
>>>>                 Compacted row minimum size: 0
>>>>                 Compacted row maximum size: 0
>>>>                 Compacted row mean size: 0
>>>>
>>>>                 Column Family: LocationInfo
>>>>                 SSTable count: 2
>>>>                 Space used (live): 3504
>>>>                 Space used (total): 3504
>>>>                 Memtable Columns Count: 0
>>>>                 Memtable Data Size: 0
>>>>                 Memtable Switch Count: 1
>>>>                 Read Count: 1
>>>>                 Read Latency: NaN ms.
>>>>                 Write Count: 7
>>>>                 Write Latency: NaN ms.
>>>>                 Pending Tasks: 0
>>>>                 Key cache capacity: 2
>>>>                 Key cache size: 1
>>>>                 Key cache hit rate: NaN
>>>>                 Row cache: disabled
>>>>                 Compacted row minimum size: 0
>>>>                 Compacted row maximum size: 0
>>>>                 Compacted row mean size: 0
>>>>
>>>>
>>>> On Tue, May 4, 2010 at 10:57 PM, Kyusik Chung <ky...@discovereads.com>wrote:
>>>>
>>>>> Im using Ubuntu 8.04 on 64 bit hosts on rackspace cloud.
>>>>>
>>>>> Im in the middle of repeating some perf tests, but so far, I get
>>>>> as-good or slightly better read perf by using standard disk access mode vs
>>>>> mmap.  So far consecutive tests are returning consistent numbers.
>>>>>
>>>>> Im not sure how to explain it...maybe its an ubuntu 8.04 issue with
>>>>> mmap.  Back when I was using mmap, I was definitely seeing the kswapd0
>>>>> process start using cpu as the box ran out of memory, and read performance
>>>>> significantly degraded.
>>>>>
>>>>> Next, Ill run some tests with mmap_index_only, and Ill test with heavy
>>>>> concurrent writes as well as reads.  Ill let everyone know what I find.
>>>>>
>>>>> Kyusik Chung
>>>>> CEO, Discovereads.com
>>>>> kyusik@discovereads.com
>>>>>
>>>>> On May 4, 2010, at 12:27 PM, Jonathan Ellis wrote:
>>>>>
>>>>> > Are you using 32 bit hosts?  If not don't be scared of mmap using a
>>>>> > lot of address space, you have plenty.  It won't make you swap more
>>>>> > than using buffered i/o.
>>>>> >
>>>>> > On Tue, May 4, 2010 at 1:57 PM, Ran Tavory <ra...@gmail.com> wrote:
>>>>> >> I canceled mmap and indeed memory usage is sane again. So far
>>>>> performance
>>>>> >> hasn't been great, but I'll wait and see.
>>>>> >> I'm also interested in a way to cap mmap so I can take advantage of
>>>>> it but
>>>>> >> not swap the host to death...
>>>>> >>
>>>>> >> On Tue, May 4, 2010 at 9:38 PM, Kyusik Chung <
>>>>> kyusik@discovereads.com>
>>>>> >> wrote:
>>>>> >>>
>>>>> >>> This sounds just like the slowness I was asking about in another
>>>>> thread -
>>>>> >>> after a lot of reads, the machine uses up all available memory on
>>>>> the box
>>>>> >>> and then starts swapping.
>>>>> >>> My understanding was that mmap helps greatly with read and write
>>>>> perf
>>>>> >>> (until the box starts swapping I guess)...is there any way to use
>>>>> mmap and
>>>>> >>> cap how much memory it takes up?
>>>>> >>> What do people use in production?  mmap or no mmap?
>>>>> >>> Thanks!
>>>>> >>> Kyusik Chung
>>>>> >>> On May 4, 2010, at 10:11 AM, Schubert Zhang wrote:
>>>>> >>>
>>>>> >>> 1. When initially startup your nodes, please plan your InitialToken
>>>>> of
>>>>> >>> each node evenly.
>>>>> >>> 2. <DiskAccessMode>standard</DiskAccessMode>
>>>>> >>>
>>>>> >>> On Tue, May 4, 2010 at 9:09 PM, Boris Shulman <sh...@gmail.com>
>>>>> wrote:
>>>>> >>>>
>>>>> >>>> I think that the extra (more than 4GB) memory usage comes from the
>>>>> >>>> mmaped io, that is why it happens only for reads.
>>>>> >>>>
>>>>> >>>> On Tue, May 4, 2010 at 2:02 PM, Jordan Pittier <
>>>>> jordan.pittier@gmail.com>
>>>>> >>>> wrote:
>>>>> >>>>> I'm facing the same issue with swap. It only occurs when I
>>>>> perform read
>>>>> >>>>> operations (write are very fast :)). So I can't help you with the
>>>>> >>>>> memory
>>>>> >>>>> probleme.
>>>>> >>>>>
>>>>> >>>>> But to balance the load evenly between nodes in cluster just
>>>>> manually
>>>>> >>>>> fix
>>>>> >>>>> their token.(the "formula" is i * 2^127 / nb_nodes).
>>>>> >>>>>
>>>>> >>>>> Jordzn
>>>>> >>>>>
>>>>> >>>>> On Tue, May 4, 2010 at 8:20 AM, Ran Tavory <ra...@gmail.com>
>>>>> wrote:
>>>>> >>>>>>
>>>>> >>>>>> I'm looking into performance issues on a 0.6.1 cluster. I see
>>>>> two
>>>>> >>>>>> symptoms:
>>>>> >>>>>> 1. Reads and writes are slow
>>>>> >>>>>> 2. One of the hosts is doing a lot of GC.
>>>>> >>>>>> 1 is slow in the sense that in normal state the cluster used to
>>>>> make
>>>>> >>>>>> around 3-5k read and writes per second (6-10k operations per
>>>>> second),
>>>>> >>>>>> but
>>>>> >>>>>> how it's in the order of 200-400 ops per second, sometimes even
>>>>> less.
>>>>> >>>>>> 2 looks like this:
>>>>> >>>>>> $ tail -f /outbrain/cassandra/log/system.log
>>>>> >>>>>>  INFO [GC inspection] 2010-05-04 00:42:18,636 GCInspector.java
>>>>> (line
>>>>> >>>>>> 110)
>>>>> >>>>>> GC for ParNew: 672 ms, 166482384 reclaimed leaving 2872087208
>>>>> used;
>>>>> >>>>>> max is
>>>>> >>>>>> 4432068608
>>>>> >>>>>>  INFO [GC inspection] 2010-05-04 00:42:28,638 GCInspector.java
>>>>> (line
>>>>> >>>>>> 110)
>>>>> >>>>>> GC for ParNew: 498 ms, 166493352 reclaimed leaving 2836049448
>>>>> used;
>>>>> >>>>>> max is
>>>>> >>>>>> 4432068608
>>>>> >>>>>>  INFO [GC inspection] 2010-05-04 00:42:38,640 GCInspector.java
>>>>> (line
>>>>> >>>>>> 110)
>>>>> >>>>>> GC for ParNew: 327 ms, 166091528 reclaimed leaving 2796888424
>>>>> used;
>>>>> >>>>>> max is
>>>>> >>>>>> 4432068608
>>>>> >>>>>> ... and it goes on and on for hours, no stopping...
>>>>> >>>>>> The cluster is made of 6 hosts, 3 in one DC and 3 in another.
>>>>> >>>>>> Each host has 8G RAM.
>>>>> >>>>>> -Xmx=4G
>>>>> >>>>>> For some reason, the load isn't distributed evenly b/w the
>>>>> hosts,
>>>>> >>>>>> although
>>>>> >>>>>> I'm not sure this is the cause for slowness
>>>>> >>>>>> $ nodetool -h localhost -p 9004 ring
>>>>> >>>>>> Address       Status     Load          Range
>>>>> >>>>>>        Ring
>>>>> >>>>>>
>>>>> >>>>>> 144413773383729447702215082383444206680
>>>>> >>>>>> 192.168.252.99Up         15.94 GB
>>>>> >>>>>>  66002764663998929243644931915471302076     |<--|
>>>>> >>>>>> 192.168.254.57Up         19.84 GB
>>>>> >>>>>>  81288739225600737067856268063987022738     |   ^
>>>>> >>>>>> 192.168.254.58Up         973.78 MB
>>>>> >>>>>> 86999744104066390588161689990810839743     v   |
>>>>> >>>>>> 192.168.252.62Up         5.18 GB
>>>>> >>>>>> 88308919879653155454332084719458267849     |   ^
>>>>> >>>>>> 192.168.254.59Up         10.57 GB
>>>>> >>>>>>  142482163220375328195837946953175033937    v   |
>>>>> >>>>>> 192.168.252.61Up         11.36 GB
>>>>> >>>>>>  144413773383729447702215082383444206680    |-->|
>>>>> >>>>>> The slow host is 192.168.252.61 and it isn't the most loaded
>>>>> one.
>>>>> >>>>>> The host is waiting a lot on IO and the load average is usually
>>>>> 6-7
>>>>> >>>>>> $ w
>>>>> >>>>>>  00:42:56 up 11 days, 13:22,  1 user,  load average: 6.21, 5.52,
>>>>> 3.93
>>>>> >>>>>> $ vmstat 5
>>>>> >>>>>> procs -----------memory---------- ---swap-- -----io----
>>>>> --system--
>>>>> >>>>>> -----cpu------
>>>>> >>>>>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in
>>>>> cs us
>>>>> >>>>>> sy id
>>>>> >>>>>> wa st
>>>>> >>>>>>  0  8 2147844  45744   1816 4457384    6    5    66    32    5
>>>>>  2  1
>>>>> >>>>>>  1
>>>>> >>>>>> 96  2  0
>>>>> >>>>>>  0  8 2147164  49020   1808 4451596  385    0  2345    58 3372
>>>>> 9957  2
>>>>> >>>>>>  2
>>>>> >>>>>> 78 18  0
>>>>> >>>>>>  0  3 2146432  45704   1812 4453956  342    0  2274   108 3937
>>>>> 10732
>>>>> >>>>>>  2  2
>>>>> >>>>>> 78 19  0
>>>>> >>>>>>  0  1 2146252  44696   1804 4453436  345  164  1939   294 3647
>>>>> 7833  2
>>>>> >>>>>>  2
>>>>> >>>>>> 78 18  0
>>>>> >>>>>>  0  1 2145960  46924   1744 4451260  158    0  2423   122 4354
>>>>> 14597
>>>>> >>>>>>  2  2
>>>>> >>>>>> 77 18  0
>>>>> >>>>>>  7  1 2138344  44676    952 4504148 1722  403  1722   406 1388
>>>>>  439 87
>>>>> >>>>>>  0
>>>>> >>>>>> 10  2  0
>>>>> >>>>>>  7  2 2137248  45652    956 4499436 1384  655  1384   658 1356
>>>>>  392 87
>>>>> >>>>>>  0
>>>>> >>>>>> 10  3  0
>>>>> >>>>>>  7  1 2135976  46764    956 4495020 1366  718  1366   718 1395
>>>>>  380 87
>>>>> >>>>>>  0
>>>>> >>>>>>  9  4  0
>>>>> >>>>>>  0  8 2134484  46964    956 4489420 1673  555  1814   586 1601
>>>>> 215590
>>>>> >>>>>> 14
>>>>> >>>>>>  2 68 16  0
>>>>> >>>>>>  0  1 2135388  47444    972 4488516  785  833  2390   995 3812
>>>>> 8305  2
>>>>> >>>>>>  2
>>>>> >>>>>> 77 20  0
>>>>> >>>>>>  0 10 2135164  45928    980 4488796  788  543  2275   626 36
>>>>> >>>>>> So, the host is swapping like crazy...
>>>>> >>>>>> top shows that it's using a lot of memory. As noted before
>>>>> -Xmx=4G and
>>>>> >>>>>> nothing else seems to be using a lot of memory on the host
>>>>> except for
>>>>> >>>>>> the
>>>>> >>>>>> cassandra process, however, of the 8G ram on the host, 92% is
>>>>> used by
>>>>> >>>>>> cassandra. How's that?
>>>>> >>>>>> Top shows there's 3.9g Shared and 7.2g Resident and 15.9g
>>>>> Virtual. Why
>>>>> >>>>>> does it have 15g virtual? And why 7.2 RES? This can explain the
>>>>> >>>>>> slowness in
>>>>> >>>>>> swapping.
>>>>> >>>>>> $ top
>>>>> >>>>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
>>>>>  COMMAND
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>> 20281 cassandr  25   0 15.9g 7.2g 3.9g S 33.3 92.6 175:30.27
>>>>> java
>>>>> >>>>>> So, can the total memory be controlled?
>>>>> >>>>>> Or perhaps I'm looking in the wrong direction...
>>>>> >>>>>> I've looked at all the cassandra JMX counts and nothing seemed
>>>>> >>>>>> suspicious
>>>>> >>>>>> so far. By suspicious i mean a large number of pending tasks -
>>>>> there
>>>>> >>>>>> were
>>>>> >>>>>> always very small numbers in each pool.
>>>>> >>>>>> About read and write latencies, I'm not sure what the normal
>>>>> state is,
>>>>> >>>>>> but
>>>>> >>>>>> here's an example of what I see on the problematic host:
>>>>> >>>>>> #mbean = org.apache.cassandra.service:type=StorageProxy:
>>>>> >>>>>> RecentReadLatencyMicros = 30105.888180684495;
>>>>> >>>>>> TotalReadLatencyMicros = 78543052801;
>>>>> >>>>>> TotalWriteLatencyMicros = 4213118609;
>>>>> >>>>>> RecentWriteLatencyMicros = 1444.4809201925639;
>>>>> >>>>>> ReadOperations = 4779553;
>>>>> >>>>>> RangeOperations = 0;
>>>>> >>>>>> TotalRangeLatencyMicros = 0;
>>>>> >>>>>> RecentRangeLatencyMicros = NaN;
>>>>> >>>>>> WriteOperations = 4740093;
>>>>> >>>>>> And the only pool that I do see some pending tasks is the
>>>>> >>>>>> ROW-READ-STAGE,
>>>>> >>>>>> but it doesn't look like much, usually around 6-8:
>>>>> >>>>>> #mbean = org.apache.cassandra.concurrent:type=ROW-READ-STAGE:
>>>>> >>>>>> ActiveCount = 8;
>>>>> >>>>>> PendingTasks = 8;
>>>>> >>>>>> CompletedTasks = 5427955;
>>>>> >>>>>> Any help finding the solution is appreciated, thanks...
>>>>> >>>>>> Below are a few more JMXes I collected from the system that may
>>>>> be
>>>>> >>>>>> interesting.
>>>>> >>>>>> #mbean = java.lang:type=Memory:
>>>>> >>>>>> Verbose = false;
>>>>> >>>>>> HeapMemoryUsage = {
>>>>> >>>>>>   committed = 3767279616;
>>>>> >>>>>>   init = 134217728;
>>>>> >>>>>>   max = 4293656576;
>>>>> >>>>>>   used = 1237105080;
>>>>> >>>>>>  };
>>>>> >>>>>> NonHeapMemoryUsage = {
>>>>> >>>>>>   committed = 35061760;
>>>>> >>>>>>   init = 24313856;
>>>>> >>>>>>   max = 138412032;
>>>>> >>>>>>   used = 23151320;
>>>>> >>>>>>  };
>>>>> >>>>>> ObjectPendingFinalizationCount = 0;
>>>>> >>>>>> #mbean = java.lang:name=ParNew,type=GarbageCollector:
>>>>> >>>>>> LastGcInfo = {
>>>>> >>>>>>   GcThreadCount = 11;
>>>>> >>>>>>   duration = 136;
>>>>> >>>>>>   endTime = 42219272;
>>>>> >>>>>>   id = 11719;
>>>>> >>>>>>   memoryUsageAfterGc = {
>>>>> >>>>>>     ( CMS Perm Gen ) = {
>>>>> >>>>>>       key = CMS Perm Gen;
>>>>> >>>>>>       value = {
>>>>> >>>>>>         committed = 29229056;
>>>>> >>>>>>         init = 21757952;
>>>>> >>>>>>         max = 88080384;
>>>>> >>>>>>         used = 17648848;
>>>>> >>>>>>        };
>>>>> >>>>>>      };
>>>>> >>>>>>     ( Code Cache ) = {
>>>>> >>>>>>       key = Code Cache;
>>>>> >>>>>>       value = {
>>>>> >>>>>>         committed = 5832704;
>>>>> >>>>>>         init = 2555904;
>>>>> >>>>>>         max = 50331648;
>>>>> >>>>>>         used = 5563520;
>>>>> >>>>>>        };
>>>>> >>>>>>      };
>>>>> >>>>>>     ( CMS Old Gen ) = {
>>>>> >>>>>>       key = CMS Old Gen;
>>>>> >>>>>>       value = {
>>>>> >>>>>>         committed = 3594133504;
>>>>> >>>>>>         init = 112459776;
>>>>> >>>>>>         max = 4120510464;
>>>>> >>>>>>         used = 964565720;
>>>>> >>>>>>        };
>>>>> >>>>>>      };
>>>>> >>>>>>     ( Par Eden Space ) = {
>>>>> >>>>>>       key = Par Eden Space;
>>>>> >>>>>>       value = {
>>>>> >>>>>>         committed = 171835392;
>>>>> >>>>>>         init = 21495808;
>>>>> >>>>>>         max = 171835392;
>>>>> >>>>>>         used = 0;
>>>>> >>>>>>        };
>>>>> >>>>>>      };
>>>>> >>>>>>     ( Par Survivor Space ) = {
>>>>> >>>>>>       key = Par Survivor Space;
>>>>> >>>>>>       value = {
>>>>> >>>>>>         committed = 1310720;
>>>>> >>>>>>         init = 131072;
>>>>> >>>>>>         max = 1310720;
>>>>> >>>>>>         used = 0;
>>>>> >>>>>>        };
>>>>> >>>>>>      };
>>>>> >>>>>>    };
>>>>> >>>>>>   memoryUsageBeforeGc = {
>>>>> >>>>>>     ( CMS Perm Gen ) = {
>>>>> >>>>>>       key = CMS Perm Gen;
>>>>> >>>>>>       value = {
>>>>> >>>>>>         committed = 29229056;
>>>>> >>>>>>         init = 21757952;
>>>>> >>>>>>         max = 88080384;
>>>>> >>>>>>         used = 17648848;
>>>>> >>>>>>        };
>>>>> >>>>>>      };
>>>>> >>>>>>     ( Code Cache ) = {
>>>>> >>>>>>       key = Code Cache;
>>>>> >>>>>>       value = {
>>>>> >>>>>>         committed = 5832704;
>>>>> >>>>>>         init = 2555904;
>>>>> >>>>>>         max = 50331648;
>>>>> >>>>>>         used = 5563520;
>>>>> >>>>>>        };
>>>>> >>>>>>      };
>>>>> >>>>>>     ( CMS Old Gen ) = {
>>>>> >>>>>>       key = CMS Old Gen;
>>>>> >>>>>>       value = {
>>>>> >>>>>>         committed = 3594133504;
>>>>> >>>>>>         init = 112459776;
>>>>> >>>>>>         max = 4120510464;
>>>>> >>>>>>         used = 959221872;
>>>>> >>>>>>        };
>>>>> >>>>>>      };
>>>>> >>>>>>     ( Par Eden Space ) = {
>>>>> >>>>>>       key = Par Eden Space;
>>>>> >>>>>>       value = {
>>>>> >>>>>>         committed = 171835392;
>>>>> >>>>>>         init = 21495808;
>>>>> >>>>>>         max = 171835392;
>>>>> >>>>>>         used = 171835392;
>>>>> >>>>>>        };
>>>>> >>>>>>      };
>>>>> >>>>>>     ( Par Survivor Space ) = {
>>>>> >>>>>>       key = Par Survivor Space;
>>>>> >>>>>>       value = {
>>>>> >>>>>>         committed = 1310720;
>>>>> >>>>>>         init = 131072;
>>>>> >>>>>>         max = 1310720;
>>>>> >>>>>>         used = 0;
>>>>> >>>>>>        };
>>>>> >>>>>>      };
>>>>> >>>>>>    };
>>>>> >>>>>>   startTime = 42219136;
>>>>> >>>>>>  };
>>>>> >>>>>> CollectionCount = 11720;
>>>>> >>>>>> CollectionTime = 4561730;
>>>>> >>>>>> Name = ParNew;
>>>>> >>>>>> Valid = true;
>>>>> >>>>>> MemoryPoolNames = [ Par Eden Space, Par Survivor Space ];
>>>>> >>>>>> #mbean = java.lang:type=OperatingSystem:
>>>>> >>>>>> MaxFileDescriptorCount = 63536;
>>>>> >>>>>> OpenFileDescriptorCount = 75;
>>>>> >>>>>> CommittedVirtualMemorySize = 17787711488;
>>>>> >>>>>> FreePhysicalMemorySize = 45522944;
>>>>> >>>>>> FreeSwapSpaceSize = 2123968512;
>>>>> >>>>>> ProcessCpuTime = 12251460000000;
>>>>> >>>>>> TotalPhysicalMemorySize = 8364417024;
>>>>> >>>>>> TotalSwapSpaceSize = 4294959104;
>>>>> >>>>>> Name = Linux;
>>>>> >>>>>> AvailableProcessors = 8;
>>>>> >>>>>> Arch = amd64;
>>>>> >>>>>> SystemLoadAverage = 4.36;
>>>>> >>>>>> Version = 2.6.18-164.15.1.el5;
>>>>> >>>>>> #mbean = java.lang:type=Runtime:
>>>>> >>>>>> Name = 20281@ob1061.nydc1.outbrain.com;
>>>>> >>>>>>
>>>>> >>>>>> ClassPath =
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> /outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../build/classes:/outbrain/cassandra/apache-cassandra-0.6.1/bin/..
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> /lib/antlr-3.1.3.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/apache-cassandra-0.6.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/avro-1.2.0-dev.jar:/outb
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> rain/cassandra/apache-cassandra-0.6.1/bin/../lib/clhm-production.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-cli-1.1.jar:/outbrain/cassandra/apache-cassandra-
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> 0.6.1/bin/../lib/commons-codec-1.2.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-collections-3.2.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/com
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> mons-lang-2.4.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/google-collections-1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/hadoop-core-0.20.1.jar:/out
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> brain/cassandra/apache-cassandra-0.6.1/bin/../lib/high-scale-lib.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/ivy-2.1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> bin/../lib/jackson-core-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jackson-mapper-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jline
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> -0.9.94.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/json-simple-1.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/libthrift-r917130.jar:/outbrain/cassandr
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> a/apache-cassandra-0.6.1/bin/../lib/log4j-1.2.14.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/slf4j-api-1.5.8.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib
>>>>> >>>>>> /slf4j-log4j12-1.5.8.jar;
>>>>> >>>>>>
>>>>> >>>>>> BootClassPath =
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> /usr/java/jdk1.6.0_17/jre/lib/alt-rt.jar:/usr/java/jdk1.6.0_17/jre/lib/resources.jar:/usr/java/jdk1.6.0_17/jre/lib/rt.jar:/usr/java/jdk1.6.0_17/jre/lib/sunrsasign.j
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> ar:/usr/java/jdk1.6.0_17/jre/lib/jsse.jar:/usr/java/jdk1.6.0_17/jre/lib/jce.jar:/usr/java/jdk1.6.0_17/jre/lib/charsets.jar:/usr/java/jdk1.6.0_17/jre/classes;
>>>>> >>>>>>
>>>>> >>>>>> LibraryPath =
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> /usr/java/jdk1.6.0_17/jre/lib/amd64/server:/usr/java/jdk1.6.0_17/jre/lib/amd64:/usr/java/jdk1.6.0_17/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib;
>>>>> >>>>>>
>>>>> >>>>>> VmName = Java HotSpot(TM) 64-Bit Server VM;
>>>>> >>>>>>
>>>>> >>>>>> VmVendor = Sun Microsystems Inc.;
>>>>> >>>>>>
>>>>> >>>>>> VmVersion = 14.3-b01;
>>>>> >>>>>>
>>>>> >>>>>> BootClassPathSupported = true;
>>>>> >>>>>>
>>>>> >>>>>> InputArguments = [ -ea, -Xms128M, -Xmx4G,
>>>>> -XX:TargetSurvivorRatio=90,
>>>>> >>>>>> -XX:+AggressiveOpts, -XX:+UseParNewGC, -XX:+UseConcMarkSweepGC,
>>>>> >>>>>> -XX:+CMSParallelRemarkEnabled, -XX:+HeapDumpOnOutOfMemoryError,
>>>>> >>>>>> -XX:SurvivorRatio=128, -XX:MaxTenuringThreshold=0,
>>>>> >>>>>> -Dcom.sun.management.jmxremote.port=9004,
>>>>> >>>>>> -Dcom.sun.management.jmxremote.ssl=false,
>>>>> >>>>>> -Dcom.sun.management.jmxremote.authenticate=false,
>>>>> >>>>>>
>>>>> >>>>>>
>>>>> -Dstorage-config=/outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf,
>>>>> >>>>>> -Dcassandra-pidfile=/var/run/cassandra.pid ];
>>>>> >>>>>>
>>>>> >>>>>> ManagementSpecVersion = 1.2;
>>>>> >>>>>>
>>>>> >>>>>> SpecName = Java Virtual Machine Specification;
>>>>> >>>>>>
>>>>> >>>>>> SpecVendor = Sun Microsystems Inc.;
>>>>> >>>>>>
>>>>> >>>>>> SpecVersion = 1.0;
>>>>> >>>>>>
>>>>> >>>>>> StartTime = 1272911001415;
>>>>> >>>>>> ...
>>>>> >>>>>
>>>>> >>>
>>>>> >>>
>>>>> >>
>>>>> >>
>>>>> >
>>>>> >
>>>>> >
>>>>> > --
>>>>> > Jonathan Ellis
>>>>> > Project Chair, Apache Cassandra
>>>>> > co-founder of Riptano, the source for professional Cassandra support
>>>>> > http://riptano.com
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: performance tuning - where does the slowness come from?

Posted by Weijun Li <we...@gmail.com>.

Our use case is a little different: our server is a typical high volume
transaction server that processes more than half billion requests per day.
The write/read ratio is close to 1, and the cluster needs to serve >10k
write+read with strict latency (<20ms) otherwise the client will treat it as
failure. Plus we have hundreds of millions of keys so the generated sstable
files are much bigger that the ram size. In this case using mmap will cause
Cassandra to use sometimes > 100G virtual memory which is much more than the
physical ram, since we are using random partitioner the OS will be busy
doing swap.

I have finally customized cassandra to meet the above requirement by using
cheap hardware (32G ram + SATA drives): one thing I learned is that you have
to carefully avoid swapping, especially when you need to cache most of the
keys in memory, swap can easily damage the performance of your in memory
cache. I also made some optimizations to reduce memory/disk consumption and
to make it easier for us to diagnose issues. In one word: cassandra is very
well written but there's still huge potential for you to improve it to meet
your special requirements.

-Weijun

On Wed, May 5, 2010 at 9:43 AM, Jordan Pittier <jo...@gmail.com>wrote:

> I disagree. Swapping could be avoided. I don't know Cassandra internals
> mechanisms but what I am expecting is that whenever I want to read rows that
> are not in RAM, Cassandra load them from hard drive to RAM if space is
> available, and, if RAM is full to reply my query without saving rows in RAM.
> No need for swapping.
>
> I have no try yet to change DiskAccessMode to standard, I hope it will help
> me.
>
> Another thing : please dont post your benchmark figures without any
> explanation on the work load generator or your cluster settings. It really
> doesn't make any sense...
>
>
> On Wed, May 5, 2010 at 6:16 PM, Weijun Li <we...@gmail.com> wrote:
>
>> When you have much more data than you can hold in memory, it will be
>> difficult for you to get around of swap which will most likely ruin your
>> performance. Also in this case mmap doesn't seem to make much sense if you
>> use random partitioner which will end up with crazy swap too. However we
>> found a way to get around read/write performance issue by integrating
>> memcached into Cassandra: in this case you need to ask memcached to disable
>> disk swap so you can achieve move than 10k read+write with milli-second
>> level of latency. Actually this is the only way that we figured out that can
>> gracefully solve the performance and memory issue.
>>
>> -Weijun
>>
>>
>> On Wed, May 5, 2010 at 8:19 AM, Ran Tavory <ra...@gmail.com> wrote:
>>
>>> I'm still trying to figure out where my slowness is coming from...
>>> By now I'm pretty sure it's the reads are slow, but not sure how to
>>> improve them.
>>>
>>> I'm looking at cfstats. Can you say if there are better configuration
>>> options? So far I've used all default settings, except for:
>>>
>>>     <Keyspace Name="outbrain_kvdb">
>>>       <ColumnFamily CompareWith="BytesType" Name="KvImpressions"
>>> KeysCached="50%"/>
>>>
>>>  <ReplicaPlacementStrategy>org.apache.cassandra.locator.RackAwareStrategy</ReplicaPlacementStrategy>
>>>       <ReplicationFactor>2</ReplicationFactor>
>>>
>>>  <EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch>
>>>     </Keyspace>
>>>
>>>
>>> What does a good read latency look like? I was expecting 10ms, however so
>>> far it seems that my KvImpressions read latency is 30ms and in the system
>>> keyspace I have 800ms :(
>>> I thought adding KeysCached="50%" would improve my situation but
>>> unfortunately looks like the hitrate is about 0. I realize that's
>>> application specific, but maybe there are other magic bullets...
>>>
>>> Is there something like adding cache to the system keyspace? 800 ms is
>>> pretty bad, isn't it?
>>>
>>> See stats below and thanks.
>>>
>>>
>>> Keyspace: outbrain_kvdb
>>>         Read Count: 651668
>>>         Read Latency: 34.18622328547666 ms.
>>>         Write Count: 655542
>>>         Write Latency: 0.041145092152752985 ms.
>>>         Pending Tasks: 0
>>>                 Column Family: KvImpressions
>>>                 SSTable count: 13
>>>                 Space used (live): 23304548897
>>>                 Space used (total): 23304548897
>>>                 Memtable Columns Count: 895
>>>                 Memtable Data Size: 2108990
>>>                 Memtable Switch Count: 8
>>>                 Read Count: 468083
>>>                 Read Latency: 151.603 ms.
>>>                 Write Count: 552566
>>>                 Write Latency: 0.023 ms.
>>>                 Pending Tasks: 0
>>>                 Key cache capacity: 17398656
>>>                 Key cache size: 567967
>>>                 Key cache hit rate: 0.0
>>>                 Row cache: disabled
>>>                 Compacted row minimum size: 269
>>>                 Compacted row maximum size: 54501
>>>                 Compacted row mean size: 933
>>> ...
>>> ----------------
>>> Keyspace: system
>>>         Read Count: 1151
>>>         Read Latency: 872.5014448305822 ms.
>>>         Write Count: 51215
>>>         Write Latency: 0.07156788050375866 ms.
>>>         Pending Tasks: 0
>>>                 Column Family: HintsColumnFamily
>>>                 SSTable count: 5
>>>                 Space used (live): 437366878
>>>                 Space used (total): 437366878
>>>                 Memtable Columns Count: 14987
>>>                 Memtable Data Size: 87975
>>>                 Memtable Switch Count: 2
>>>                 Read Count: 1150
>>>                 Read Latency: NaN ms.
>>>                 Write Count: 51211
>>>                 Write Latency: 0.027 ms.
>>>                 Pending Tasks: 0
>>>                 Key cache capacity: 6
>>>                 Key cache size: 4
>>>                 Key cache hit rate: NaN
>>>                 Row cache: disabled
>>>                 Compacted row minimum size: 0
>>>                 Compacted row maximum size: 0
>>>                 Compacted row mean size: 0
>>>
>>>                 Column Family: LocationInfo
>>>                 SSTable count: 2
>>>                 Space used (live): 3504
>>>                 Space used (total): 3504
>>>                 Memtable Columns Count: 0
>>>                 Memtable Data Size: 0
>>>                 Memtable Switch Count: 1
>>>                 Read Count: 1
>>>                 Read Latency: NaN ms.
>>>                 Write Count: 7
>>>                 Write Latency: NaN ms.
>>>                 Pending Tasks: 0
>>>                 Key cache capacity: 2
>>>                 Key cache size: 1
>>>                 Key cache hit rate: NaN
>>>                 Row cache: disabled
>>>                 Compacted row minimum size: 0
>>>                 Compacted row maximum size: 0
>>>                 Compacted row mean size: 0
>>>
>>>
>>> On Tue, May 4, 2010 at 10:57 PM, Kyusik Chung <ky...@discovereads.com>wrote:
>>>
>>>> Im using Ubuntu 8.04 on 64 bit hosts on rackspace cloud.
>>>>
>>>> Im in the middle of repeating some perf tests, but so far, I get as-good
>>>> or slightly better read perf by using standard disk access mode vs mmap.  So
>>>> far consecutive tests are returning consistent numbers.
>>>>
>>>> Im not sure how to explain it...maybe its an ubuntu 8.04 issue with
>>>> mmap.  Back when I was using mmap, I was definitely seeing the kswapd0
>>>> process start using cpu as the box ran out of memory, and read performance
>>>> significantly degraded.
>>>>
>>>> Next, Ill run some tests with mmap_index_only, and Ill test with heavy
>>>> concurrent writes as well as reads.  Ill let everyone know what I find.
>>>>
>>>> Kyusik Chung
>>>> CEO, Discovereads.com
>>>> kyusik@discovereads.com
>>>>
>>>> On May 4, 2010, at 12:27 PM, Jonathan Ellis wrote:
>>>>
>>>> > Are you using 32 bit hosts?  If not don't be scared of mmap using a
>>>> > lot of address space, you have plenty.  It won't make you swap more
>>>> > than using buffered i/o.
>>>> >
>>>> > On Tue, May 4, 2010 at 1:57 PM, Ran Tavory <ra...@gmail.com> wrote:
>>>> >> I canceled mmap and indeed memory usage is sane again. So far
>>>> performance
>>>> >> hasn't been great, but I'll wait and see.
>>>> >> I'm also interested in a way to cap mmap so I can take advantage of
>>>> it but
>>>> >> not swap the host to death...
>>>> >>
>>>> >> On Tue, May 4, 2010 at 9:38 PM, Kyusik Chung <
>>>> kyusik@discovereads.com>
>>>> >> wrote:
>>>> >>>
>>>> >>> This sounds just like the slowness I was asking about in another
>>>> thread -
>>>> >>> after a lot of reads, the machine uses up all available memory on
>>>> the box
>>>> >>> and then starts swapping.
>>>> >>> My understanding was that mmap helps greatly with read and write
>>>> perf
>>>> >>> (until the box starts swapping I guess)...is there any way to use
>>>> mmap and
>>>> >>> cap how much memory it takes up?
>>>> >>> What do people use in production?  mmap or no mmap?
>>>> >>> Thanks!
>>>> >>> Kyusik Chung
>>>> >>> On May 4, 2010, at 10:11 AM, Schubert Zhang wrote:
>>>> >>>
>>>> >>> 1. When initially startup your nodes, please plan your InitialToken
>>>> of
>>>> >>> each node evenly.
>>>> >>> 2. <DiskAccessMode>standard</DiskAccessMode>
>>>> >>>
>>>> >>> On Tue, May 4, 2010 at 9:09 PM, Boris Shulman <sh...@gmail.com>
>>>> wrote:
>>>> >>>>
>>>> >>>> I think that the extra (more than 4GB) memory usage comes from the
>>>> >>>> mmaped io, that is why it happens only for reads.
>>>> >>>>
>>>> >>>> On Tue, May 4, 2010 at 2:02 PM, Jordan Pittier <
>>>> jordan.pittier@gmail.com>
>>>> >>>> wrote:
>>>> >>>>> I'm facing the same issue with swap. It only occurs when I perform
>>>> read
>>>> >>>>> operations (write are very fast :)). So I can't help you with the
>>>> >>>>> memory
>>>> >>>>> probleme.
>>>> >>>>>
>>>> >>>>> But to balance the load evenly between nodes in cluster just
>>>> manually
>>>> >>>>> fix
>>>> >>>>> their token.(the "formula" is i * 2^127 / nb_nodes).
>>>> >>>>>
>>>> >>>>> Jordzn
>>>> >>>>>
>>>> >>>>> On Tue, May 4, 2010 at 8:20 AM, Ran Tavory <ra...@gmail.com>
>>>> wrote:
>>>> >>>>>>
>>>> >>>>>> I'm looking into performance issues on a 0.6.1 cluster. I see two
>>>> >>>>>> symptoms:
>>>> >>>>>> 1. Reads and writes are slow
>>>> >>>>>> 2. One of the hosts is doing a lot of GC.
>>>> >>>>>> 1 is slow in the sense that in normal state the cluster used to
>>>> make
>>>> >>>>>> around 3-5k read and writes per second (6-10k operations per
>>>> second),
>>>> >>>>>> but
>>>> >>>>>> how it's in the order of 200-400 ops per second, sometimes even
>>>> less.
>>>> >>>>>> 2 looks like this:
>>>> >>>>>> $ tail -f /outbrain/cassandra/log/system.log
>>>> >>>>>>  INFO [GC inspection] 2010-05-04 00:42:18,636 GCInspector.java
>>>> (line
>>>> >>>>>> 110)
>>>> >>>>>> GC for ParNew: 672 ms, 166482384 reclaimed leaving 2872087208
>>>> used;
>>>> >>>>>> max is
>>>> >>>>>> 4432068608
>>>> >>>>>>  INFO [GC inspection] 2010-05-04 00:42:28,638 GCInspector.java
>>>> (line
>>>> >>>>>> 110)
>>>> >>>>>> GC for ParNew: 498 ms, 166493352 reclaimed leaving 2836049448
>>>> used;
>>>> >>>>>> max is
>>>> >>>>>> 4432068608
>>>> >>>>>>  INFO [GC inspection] 2010-05-04 00:42:38,640 GCInspector.java
>>>> (line
>>>> >>>>>> 110)
>>>> >>>>>> GC for ParNew: 327 ms, 166091528 reclaimed leaving 2796888424
>>>> used;
>>>> >>>>>> max is
>>>> >>>>>> 4432068608
>>>> >>>>>> ... and it goes on and on for hours, no stopping...
>>>> >>>>>> The cluster is made of 6 hosts, 3 in one DC and 3 in another.
>>>> >>>>>> Each host has 8G RAM.
>>>> >>>>>> -Xmx=4G
>>>> >>>>>> For some reason, the load isn't distributed evenly b/w the hosts,
>>>> >>>>>> although
>>>> >>>>>> I'm not sure this is the cause for slowness
>>>> >>>>>> $ nodetool -h localhost -p 9004 ring
>>>> >>>>>> Address       Status     Load          Range
>>>> >>>>>>        Ring
>>>> >>>>>>
>>>> >>>>>> 144413773383729447702215082383444206680
>>>> >>>>>> 192.168.252.99Up         15.94 GB
>>>> >>>>>>  66002764663998929243644931915471302076     |<--|
>>>> >>>>>> 192.168.254.57Up         19.84 GB
>>>> >>>>>>  81288739225600737067856268063987022738     |   ^
>>>> >>>>>> 192.168.254.58Up         973.78 MB
>>>> >>>>>> 86999744104066390588161689990810839743     v   |
>>>> >>>>>> 192.168.252.62Up         5.18 GB
>>>> >>>>>> 88308919879653155454332084719458267849     |   ^
>>>> >>>>>> 192.168.254.59Up         10.57 GB
>>>> >>>>>>  142482163220375328195837946953175033937    v   |
>>>> >>>>>> 192.168.252.61Up         11.36 GB
>>>> >>>>>>  144413773383729447702215082383444206680    |-->|
>>>> >>>>>> The slow host is 192.168.252.61 and it isn't the most loaded one.
>>>> >>>>>> The host is waiting a lot on IO and the load average is usually
>>>> 6-7
>>>> >>>>>> $ w
>>>> >>>>>>  00:42:56 up 11 days, 13:22,  1 user,  load average: 6.21, 5.52,
>>>> 3.93
>>>> >>>>>> $ vmstat 5
>>>> >>>>>> procs -----------memory---------- ---swap-- -----io----
>>>> --system--
>>>> >>>>>> -----cpu------
>>>> >>>>>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs
>>>> us
>>>> >>>>>> sy id
>>>> >>>>>> wa st
>>>> >>>>>>  0  8 2147844  45744   1816 4457384    6    5    66    32    5
>>>>  2  1
>>>> >>>>>>  1
>>>> >>>>>> 96  2  0
>>>> >>>>>>  0  8 2147164  49020   1808 4451596  385    0  2345    58 3372
>>>> 9957  2
>>>> >>>>>>  2
>>>> >>>>>> 78 18  0
>>>> >>>>>>  0  3 2146432  45704   1812 4453956  342    0  2274   108 3937
>>>> 10732
>>>> >>>>>>  2  2
>>>> >>>>>> 78 19  0
>>>> >>>>>>  0  1 2146252  44696   1804 4453436  345  164  1939   294 3647
>>>> 7833  2
>>>> >>>>>>  2
>>>> >>>>>> 78 18  0
>>>> >>>>>>  0  1 2145960  46924   1744 4451260  158    0  2423   122 4354
>>>> 14597
>>>> >>>>>>  2  2
>>>> >>>>>> 77 18  0
>>>> >>>>>>  7  1 2138344  44676    952 4504148 1722  403  1722   406 1388
>>>>  439 87
>>>> >>>>>>  0
>>>> >>>>>> 10  2  0
>>>> >>>>>>  7  2 2137248  45652    956 4499436 1384  655  1384   658 1356
>>>>  392 87
>>>> >>>>>>  0
>>>> >>>>>> 10  3  0
>>>> >>>>>>  7  1 2135976  46764    956 4495020 1366  718  1366   718 1395
>>>>  380 87
>>>> >>>>>>  0
>>>> >>>>>>  9  4  0
>>>> >>>>>>  0  8 2134484  46964    956 4489420 1673  555  1814   586 1601
>>>> 215590
>>>> >>>>>> 14
>>>> >>>>>>  2 68 16  0
>>>> >>>>>>  0  1 2135388  47444    972 4488516  785  833  2390   995 3812
>>>> 8305  2
>>>> >>>>>>  2
>>>> >>>>>> 77 20  0
>>>> >>>>>>  0 10 2135164  45928    980 4488796  788  543  2275   626 36
>>>> >>>>>> So, the host is swapping like crazy...
>>>> >>>>>> top shows that it's using a lot of memory. As noted before
>>>> -Xmx=4G and
>>>> >>>>>> nothing else seems to be using a lot of memory on the host except
>>>> for
>>>> >>>>>> the
>>>> >>>>>> cassandra process, however, of the 8G ram on the host, 92% is
>>>> used by
>>>> >>>>>> cassandra. How's that?
>>>> >>>>>> Top shows there's 3.9g Shared and 7.2g Resident and 15.9g
>>>> Virtual. Why
>>>> >>>>>> does it have 15g virtual? And why 7.2 RES? This can explain the
>>>> >>>>>> slowness in
>>>> >>>>>> swapping.
>>>> >>>>>> $ top
>>>> >>>>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
>>>>  COMMAND
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> 20281 cassandr  25   0 15.9g 7.2g 3.9g S 33.3 92.6 175:30.27 java
>>>> >>>>>> So, can the total memory be controlled?
>>>> >>>>>> Or perhaps I'm looking in the wrong direction...
>>>> >>>>>> I've looked at all the cassandra JMX counts and nothing seemed
>>>> >>>>>> suspicious
>>>> >>>>>> so far. By suspicious i mean a large number of pending tasks -
>>>> there
>>>> >>>>>> were
>>>> >>>>>> always very small numbers in each pool.
>>>> >>>>>> About read and write latencies, I'm not sure what the normal
>>>> state is,
>>>> >>>>>> but
>>>> >>>>>> here's an example of what I see on the problematic host:
>>>> >>>>>> #mbean = org.apache.cassandra.service:type=StorageProxy:
>>>> >>>>>> RecentReadLatencyMicros = 30105.888180684495;
>>>> >>>>>> TotalReadLatencyMicros = 78543052801;
>>>> >>>>>> TotalWriteLatencyMicros = 4213118609;
>>>> >>>>>> RecentWriteLatencyMicros = 1444.4809201925639;
>>>> >>>>>> ReadOperations = 4779553;
>>>> >>>>>> RangeOperations = 0;
>>>> >>>>>> TotalRangeLatencyMicros = 0;
>>>> >>>>>> RecentRangeLatencyMicros = NaN;
>>>> >>>>>> WriteOperations = 4740093;
>>>> >>>>>> And the only pool that I do see some pending tasks is the
>>>> >>>>>> ROW-READ-STAGE,
>>>> >>>>>> but it doesn't look like much, usually around 6-8:
>>>> >>>>>> #mbean = org.apache.cassandra.concurrent:type=ROW-READ-STAGE:
>>>> >>>>>> ActiveCount = 8;
>>>> >>>>>> PendingTasks = 8;
>>>> >>>>>> CompletedTasks = 5427955;
>>>> >>>>>> Any help finding the solution is appreciated, thanks...
>>>> >>>>>> Below are a few more JMXes I collected from the system that may
>>>> be
>>>> >>>>>> interesting.
>>>> >>>>>> #mbean = java.lang:type=Memory:
>>>> >>>>>> Verbose = false;
>>>> >>>>>> HeapMemoryUsage = {
>>>> >>>>>>   committed = 3767279616;
>>>> >>>>>>   init = 134217728;
>>>> >>>>>>   max = 4293656576;
>>>> >>>>>>   used = 1237105080;
>>>> >>>>>>  };
>>>> >>>>>> NonHeapMemoryUsage = {
>>>> >>>>>>   committed = 35061760;
>>>> >>>>>>   init = 24313856;
>>>> >>>>>>   max = 138412032;
>>>> >>>>>>   used = 23151320;
>>>> >>>>>>  };
>>>> >>>>>> ObjectPendingFinalizationCount = 0;
>>>> >>>>>> #mbean = java.lang:name=ParNew,type=GarbageCollector:
>>>> >>>>>> LastGcInfo = {
>>>> >>>>>>   GcThreadCount = 11;
>>>> >>>>>>   duration = 136;
>>>> >>>>>>   endTime = 42219272;
>>>> >>>>>>   id = 11719;
>>>> >>>>>>   memoryUsageAfterGc = {
>>>> >>>>>>     ( CMS Perm Gen ) = {
>>>> >>>>>>       key = CMS Perm Gen;
>>>> >>>>>>       value = {
>>>> >>>>>>         committed = 29229056;
>>>> >>>>>>         init = 21757952;
>>>> >>>>>>         max = 88080384;
>>>> >>>>>>         used = 17648848;
>>>> >>>>>>        };
>>>> >>>>>>      };
>>>> >>>>>>     ( Code Cache ) = {
>>>> >>>>>>       key = Code Cache;
>>>> >>>>>>       value = {
>>>> >>>>>>         committed = 5832704;
>>>> >>>>>>         init = 2555904;
>>>> >>>>>>         max = 50331648;
>>>> >>>>>>         used = 5563520;
>>>> >>>>>>        };
>>>> >>>>>>      };
>>>> >>>>>>     ( CMS Old Gen ) = {
>>>> >>>>>>       key = CMS Old Gen;
>>>> >>>>>>       value = {
>>>> >>>>>>         committed = 3594133504;
>>>> >>>>>>         init = 112459776;
>>>> >>>>>>         max = 4120510464;
>>>> >>>>>>         used = 964565720;
>>>> >>>>>>        };
>>>> >>>>>>      };
>>>> >>>>>>     ( Par Eden Space ) = {
>>>> >>>>>>       key = Par Eden Space;
>>>> >>>>>>       value = {
>>>> >>>>>>         committed = 171835392;
>>>> >>>>>>         init = 21495808;
>>>> >>>>>>         max = 171835392;
>>>> >>>>>>         used = 0;
>>>> >>>>>>        };
>>>> >>>>>>      };
>>>> >>>>>>     ( Par Survivor Space ) = {
>>>> >>>>>>       key = Par Survivor Space;
>>>> >>>>>>       value = {
>>>> >>>>>>         committed = 1310720;
>>>> >>>>>>         init = 131072;
>>>> >>>>>>         max = 1310720;
>>>> >>>>>>         used = 0;
>>>> >>>>>>        };
>>>> >>>>>>      };
>>>> >>>>>>    };
>>>> >>>>>>   memoryUsageBeforeGc = {
>>>> >>>>>>     ( CMS Perm Gen ) = {
>>>> >>>>>>       key = CMS Perm Gen;
>>>> >>>>>>       value = {
>>>> >>>>>>         committed = 29229056;
>>>> >>>>>>         init = 21757952;
>>>> >>>>>>         max = 88080384;
>>>> >>>>>>         used = 17648848;
>>>> >>>>>>        };
>>>> >>>>>>      };
>>>> >>>>>>     ( Code Cache ) = {
>>>> >>>>>>       key = Code Cache;
>>>> >>>>>>       value = {
>>>> >>>>>>         committed = 5832704;
>>>> >>>>>>         init = 2555904;
>>>> >>>>>>         max = 50331648;
>>>> >>>>>>         used = 5563520;
>>>> >>>>>>        };
>>>> >>>>>>      };
>>>> >>>>>>     ( CMS Old Gen ) = {
>>>> >>>>>>       key = CMS Old Gen;
>>>> >>>>>>       value = {
>>>> >>>>>>         committed = 3594133504;
>>>> >>>>>>         init = 112459776;
>>>> >>>>>>         max = 4120510464;
>>>> >>>>>>         used = 959221872;
>>>> >>>>>>        };
>>>> >>>>>>      };
>>>> >>>>>>     ( Par Eden Space ) = {
>>>> >>>>>>       key = Par Eden Space;
>>>> >>>>>>       value = {
>>>> >>>>>>         committed = 171835392;
>>>> >>>>>>         init = 21495808;
>>>> >>>>>>         max = 171835392;
>>>> >>>>>>         used = 171835392;
>>>> >>>>>>        };
>>>> >>>>>>      };
>>>> >>>>>>     ( Par Survivor Space ) = {
>>>> >>>>>>       key = Par Survivor Space;
>>>> >>>>>>       value = {
>>>> >>>>>>         committed = 1310720;
>>>> >>>>>>         init = 131072;
>>>> >>>>>>         max = 1310720;
>>>> >>>>>>         used = 0;
>>>> >>>>>>        };
>>>> >>>>>>      };
>>>> >>>>>>    };
>>>> >>>>>>   startTime = 42219136;
>>>> >>>>>>  };
>>>> >>>>>> CollectionCount = 11720;
>>>> >>>>>> CollectionTime = 4561730;
>>>> >>>>>> Name = ParNew;
>>>> >>>>>> Valid = true;
>>>> >>>>>> MemoryPoolNames = [ Par Eden Space, Par Survivor Space ];
>>>> >>>>>> #mbean = java.lang:type=OperatingSystem:
>>>> >>>>>> MaxFileDescriptorCount = 63536;
>>>> >>>>>> OpenFileDescriptorCount = 75;
>>>> >>>>>> CommittedVirtualMemorySize = 17787711488;
>>>> >>>>>> FreePhysicalMemorySize = 45522944;
>>>> >>>>>> FreeSwapSpaceSize = 2123968512;
>>>> >>>>>> ProcessCpuTime = 12251460000000;
>>>> >>>>>> TotalPhysicalMemorySize = 8364417024;
>>>> >>>>>> TotalSwapSpaceSize = 4294959104;
>>>> >>>>>> Name = Linux;
>>>> >>>>>> AvailableProcessors = 8;
>>>> >>>>>> Arch = amd64;
>>>> >>>>>> SystemLoadAverage = 4.36;
>>>> >>>>>> Version = 2.6.18-164.15.1.el5;
>>>> >>>>>> #mbean = java.lang:type=Runtime:
>>>> >>>>>> Name = 20281@ob1061.nydc1.outbrain.com;
>>>> >>>>>>
>>>> >>>>>> ClassPath =
>>>> >>>>>>
>>>> >>>>>>
>>>> /outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../build/classes:/outbrain/cassandra/apache-cassandra-0.6.1/bin/..
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> /lib/antlr-3.1.3.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/apache-cassandra-0.6.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/avro-1.2.0-dev.jar:/outb
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> rain/cassandra/apache-cassandra-0.6.1/bin/../lib/clhm-production.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-cli-1.1.jar:/outbrain/cassandra/apache-cassandra-
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> 0.6.1/bin/../lib/commons-codec-1.2.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-collections-3.2.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/com
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> mons-lang-2.4.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/google-collections-1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/hadoop-core-0.20.1.jar:/out
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> brain/cassandra/apache-cassandra-0.6.1/bin/../lib/high-scale-lib.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/ivy-2.1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> bin/../lib/jackson-core-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jackson-mapper-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jline
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> -0.9.94.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/json-simple-1.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/libthrift-r917130.jar:/outbrain/cassandr
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> a/apache-cassandra-0.6.1/bin/../lib/log4j-1.2.14.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/slf4j-api-1.5.8.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib
>>>> >>>>>> /slf4j-log4j12-1.5.8.jar;
>>>> >>>>>>
>>>> >>>>>> BootClassPath =
>>>> >>>>>>
>>>> >>>>>>
>>>> /usr/java/jdk1.6.0_17/jre/lib/alt-rt.jar:/usr/java/jdk1.6.0_17/jre/lib/resources.jar:/usr/java/jdk1.6.0_17/jre/lib/rt.jar:/usr/java/jdk1.6.0_17/jre/lib/sunrsasign.j
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> ar:/usr/java/jdk1.6.0_17/jre/lib/jsse.jar:/usr/java/jdk1.6.0_17/jre/lib/jce.jar:/usr/java/jdk1.6.0_17/jre/lib/charsets.jar:/usr/java/jdk1.6.0_17/jre/classes;
>>>> >>>>>>
>>>> >>>>>> LibraryPath =
>>>> >>>>>>
>>>> >>>>>>
>>>> /usr/java/jdk1.6.0_17/jre/lib/amd64/server:/usr/java/jdk1.6.0_17/jre/lib/amd64:/usr/java/jdk1.6.0_17/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib;
>>>> >>>>>>
>>>> >>>>>> VmName = Java HotSpot(TM) 64-Bit Server VM;
>>>> >>>>>>
>>>> >>>>>> VmVendor = Sun Microsystems Inc.;
>>>> >>>>>>
>>>> >>>>>> VmVersion = 14.3-b01;
>>>> >>>>>>
>>>> >>>>>> BootClassPathSupported = true;
>>>> >>>>>>
>>>> >>>>>> InputArguments = [ -ea, -Xms128M, -Xmx4G,
>>>> -XX:TargetSurvivorRatio=90,
>>>> >>>>>> -XX:+AggressiveOpts, -XX:+UseParNewGC, -XX:+UseConcMarkSweepGC,
>>>> >>>>>> -XX:+CMSParallelRemarkEnabled, -XX:+HeapDumpOnOutOfMemoryError,
>>>> >>>>>> -XX:SurvivorRatio=128, -XX:MaxTenuringThreshold=0,
>>>> >>>>>> -Dcom.sun.management.jmxremote.port=9004,
>>>> >>>>>> -Dcom.sun.management.jmxremote.ssl=false,
>>>> >>>>>> -Dcom.sun.management.jmxremote.authenticate=false,
>>>> >>>>>>
>>>> >>>>>>
>>>> -Dstorage-config=/outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf,
>>>> >>>>>> -Dcassandra-pidfile=/var/run/cassandra.pid ];
>>>> >>>>>>
>>>> >>>>>> ManagementSpecVersion = 1.2;
>>>> >>>>>>
>>>> >>>>>> SpecName = Java Virtual Machine Specification;
>>>> >>>>>>
>>>> >>>>>> SpecVendor = Sun Microsystems Inc.;
>>>> >>>>>>
>>>> >>>>>> SpecVersion = 1.0;
>>>> >>>>>>
>>>> >>>>>> StartTime = 1272911001415;
>>>> >>>>>> ...
>>>> >>>>>
>>>> >>>
>>>> >>>
>>>> >>
>>>> >>
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Jonathan Ellis
>>>> > Project Chair, Apache Cassandra
>>>> > co-founder of Riptano, the source for professional Cassandra support
>>>> > http://riptano.com
>>>>
>>>>
>>>
>>
>

Re: performance tuning - where does the slowness come from?

Posted by Jordan Pittier <jo...@gmail.com>.

I disagree. Swapping could be avoided. I don't know Cassandra internals
mechanisms but what I am expecting is that whenever I want to read rows that
are not in RAM, Cassandra load them from hard drive to RAM if space is
available, and, if RAM is full to reply my query without saving rows in RAM.
No need for swapping.

I have no try yet to change DiskAccessMode to standard, I hope it will help
me.

Another thing : please dont post your benchmark figures without any
explanation on the work load generator or your cluster settings. It really
doesn't make any sense...

On Wed, May 5, 2010 at 6:16 PM, Weijun Li <we...@gmail.com> wrote:

> When you have much more data than you can hold in memory, it will be
> difficult for you to get around of swap which will most likely ruin your
> performance. Also in this case mmap doesn't seem to make much sense if you
> use random partitioner which will end up with crazy swap too. However we
> found a way to get around read/write performance issue by integrating
> memcached into Cassandra: in this case you need to ask memcached to disable
> disk swap so you can achieve move than 10k read+write with milli-second
> level of latency. Actually this is the only way that we figured out that can
> gracefully solve the performance and memory issue.
>
> -Weijun
>
>
> On Wed, May 5, 2010 at 8:19 AM, Ran Tavory <ra...@gmail.com> wrote:
>
>> I'm still trying to figure out where my slowness is coming from...
>> By now I'm pretty sure it's the reads are slow, but not sure how to
>> improve them.
>>
>> I'm looking at cfstats. Can you say if there are better configuration
>> options? So far I've used all default settings, except for:
>>
>>     <Keyspace Name="outbrain_kvdb">
>>       <ColumnFamily CompareWith="BytesType" Name="KvImpressions"
>> KeysCached="50%"/>
>>
>>  <ReplicaPlacementStrategy>org.apache.cassandra.locator.RackAwareStrategy</ReplicaPlacementStrategy>
>>       <ReplicationFactor>2</ReplicationFactor>
>>
>>  <EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch>
>>     </Keyspace>
>>
>>
>> What does a good read latency look like? I was expecting 10ms, however so
>> far it seems that my KvImpressions read latency is 30ms and in the system
>> keyspace I have 800ms :(
>> I thought adding KeysCached="50%" would improve my situation but
>> unfortunately looks like the hitrate is about 0. I realize that's
>> application specific, but maybe there are other magic bullets...
>>
>> Is there something like adding cache to the system keyspace? 800 ms is
>> pretty bad, isn't it?
>>
>> See stats below and thanks.
>>
>>
>> Keyspace: outbrain_kvdb
>>         Read Count: 651668
>>         Read Latency: 34.18622328547666 ms.
>>         Write Count: 655542
>>         Write Latency: 0.041145092152752985 ms.
>>         Pending Tasks: 0
>>                 Column Family: KvImpressions
>>                 SSTable count: 13
>>                 Space used (live): 23304548897
>>                 Space used (total): 23304548897
>>                 Memtable Columns Count: 895
>>                 Memtable Data Size: 2108990
>>                 Memtable Switch Count: 8
>>                 Read Count: 468083
>>                 Read Latency: 151.603 ms.
>>                 Write Count: 552566
>>                 Write Latency: 0.023 ms.
>>                 Pending Tasks: 0
>>                 Key cache capacity: 17398656
>>                 Key cache size: 567967
>>                 Key cache hit rate: 0.0
>>                 Row cache: disabled
>>                 Compacted row minimum size: 269
>>                 Compacted row maximum size: 54501
>>                 Compacted row mean size: 933
>> ...
>> ----------------
>> Keyspace: system
>>         Read Count: 1151
>>         Read Latency: 872.5014448305822 ms.
>>         Write Count: 51215
>>         Write Latency: 0.07156788050375866 ms.
>>         Pending Tasks: 0
>>                 Column Family: HintsColumnFamily
>>                 SSTable count: 5
>>                 Space used (live): 437366878
>>                 Space used (total): 437366878
>>                 Memtable Columns Count: 14987
>>                 Memtable Data Size: 87975
>>                 Memtable Switch Count: 2
>>                 Read Count: 1150
>>                 Read Latency: NaN ms.
>>                 Write Count: 51211
>>                 Write Latency: 0.027 ms.
>>                 Pending Tasks: 0
>>                 Key cache capacity: 6
>>                 Key cache size: 4
>>                 Key cache hit rate: NaN
>>                 Row cache: disabled
>>                 Compacted row minimum size: 0
>>                 Compacted row maximum size: 0
>>                 Compacted row mean size: 0
>>
>>                 Column Family: LocationInfo
>>                 SSTable count: 2
>>                 Space used (live): 3504
>>                 Space used (total): 3504
>>                 Memtable Columns Count: 0
>>                 Memtable Data Size: 0
>>                 Memtable Switch Count: 1
>>                 Read Count: 1
>>                 Read Latency: NaN ms.
>>                 Write Count: 7
>>                 Write Latency: NaN ms.
>>                 Pending Tasks: 0
>>                 Key cache capacity: 2
>>                 Key cache size: 1
>>                 Key cache hit rate: NaN
>>                 Row cache: disabled
>>                 Compacted row minimum size: 0
>>                 Compacted row maximum size: 0
>>                 Compacted row mean size: 0
>>
>>
>> On Tue, May 4, 2010 at 10:57 PM, Kyusik Chung <ky...@discovereads.com>wrote:
>>
>>> Im using Ubuntu 8.04 on 64 bit hosts on rackspace cloud.
>>>
>>> Im in the middle of repeating some perf tests, but so far, I get as-good
>>> or slightly better read perf by using standard disk access mode vs mmap.  So
>>> far consecutive tests are returning consistent numbers.
>>>
>>> Im not sure how to explain it...maybe its an ubuntu 8.04 issue with mmap.
>>>  Back when I was using mmap, I was definitely seeing the kswapd0 process
>>> start using cpu as the box ran out of memory, and read performance
>>> significantly degraded.
>>>
>>> Next, Ill run some tests with mmap_index_only, and Ill test with heavy
>>> concurrent writes as well as reads.  Ill let everyone know what I find.
>>>
>>> Kyusik Chung
>>> CEO, Discovereads.com
>>> kyusik@discovereads.com
>>>
>>> On May 4, 2010, at 12:27 PM, Jonathan Ellis wrote:
>>>
>>> > Are you using 32 bit hosts?  If not don't be scared of mmap using a
>>> > lot of address space, you have plenty.  It won't make you swap more
>>> > than using buffered i/o.
>>> >
>>> > On Tue, May 4, 2010 at 1:57 PM, Ran Tavory <ra...@gmail.com> wrote:
>>> >> I canceled mmap and indeed memory usage is sane again. So far
>>> performance
>>> >> hasn't been great, but I'll wait and see.
>>> >> I'm also interested in a way to cap mmap so I can take advantage of it
>>> but
>>> >> not swap the host to death...
>>> >>
>>> >> On Tue, May 4, 2010 at 9:38 PM, Kyusik Chung <kyusik@discovereads.com
>>> >
>>> >> wrote:
>>> >>>
>>> >>> This sounds just like the slowness I was asking about in another
>>> thread -
>>> >>> after a lot of reads, the machine uses up all available memory on the
>>> box
>>> >>> and then starts swapping.
>>> >>> My understanding was that mmap helps greatly with read and write perf
>>> >>> (until the box starts swapping I guess)...is there any way to use
>>> mmap and
>>> >>> cap how much memory it takes up?
>>> >>> What do people use in production?  mmap or no mmap?
>>> >>> Thanks!
>>> >>> Kyusik Chung
>>> >>> On May 4, 2010, at 10:11 AM, Schubert Zhang wrote:
>>> >>>
>>> >>> 1. When initially startup your nodes, please plan your InitialToken
>>> of
>>> >>> each node evenly.
>>> >>> 2. <DiskAccessMode>standard</DiskAccessMode>
>>> >>>
>>> >>> On Tue, May 4, 2010 at 9:09 PM, Boris Shulman <sh...@gmail.com>
>>> wrote:
>>> >>>>
>>> >>>> I think that the extra (more than 4GB) memory usage comes from the
>>> >>>> mmaped io, that is why it happens only for reads.
>>> >>>>
>>> >>>> On Tue, May 4, 2010 at 2:02 PM, Jordan Pittier <
>>> jordan.pittier@gmail.com>
>>> >>>> wrote:
>>> >>>>> I'm facing the same issue with swap. It only occurs when I perform
>>> read
>>> >>>>> operations (write are very fast :)). So I can't help you with the
>>> >>>>> memory
>>> >>>>> probleme.
>>> >>>>>
>>> >>>>> But to balance the load evenly between nodes in cluster just
>>> manually
>>> >>>>> fix
>>> >>>>> their token.(the "formula" is i * 2^127 / nb_nodes).
>>> >>>>>
>>> >>>>> Jordzn
>>> >>>>>
>>> >>>>> On Tue, May 4, 2010 at 8:20 AM, Ran Tavory <ra...@gmail.com>
>>> wrote:
>>> >>>>>>
>>> >>>>>> I'm looking into performance issues on a 0.6.1 cluster. I see two
>>> >>>>>> symptoms:
>>> >>>>>> 1. Reads and writes are slow
>>> >>>>>> 2. One of the hosts is doing a lot of GC.
>>> >>>>>> 1 is slow in the sense that in normal state the cluster used to
>>> make
>>> >>>>>> around 3-5k read and writes per second (6-10k operations per
>>> second),
>>> >>>>>> but
>>> >>>>>> how it's in the order of 200-400 ops per second, sometimes even
>>> less.
>>> >>>>>> 2 looks like this:
>>> >>>>>> $ tail -f /outbrain/cassandra/log/system.log
>>> >>>>>>  INFO [GC inspection] 2010-05-04 00:42:18,636 GCInspector.java
>>> (line
>>> >>>>>> 110)
>>> >>>>>> GC for ParNew: 672 ms, 166482384 reclaimed leaving 2872087208
>>> used;
>>> >>>>>> max is
>>> >>>>>> 4432068608
>>> >>>>>>  INFO [GC inspection] 2010-05-04 00:42:28,638 GCInspector.java
>>> (line
>>> >>>>>> 110)
>>> >>>>>> GC for ParNew: 498 ms, 166493352 reclaimed leaving 2836049448
>>> used;
>>> >>>>>> max is
>>> >>>>>> 4432068608
>>> >>>>>>  INFO [GC inspection] 2010-05-04 00:42:38,640 GCInspector.java
>>> (line
>>> >>>>>> 110)
>>> >>>>>> GC for ParNew: 327 ms, 166091528 reclaimed leaving 2796888424
>>> used;
>>> >>>>>> max is
>>> >>>>>> 4432068608
>>> >>>>>> ... and it goes on and on for hours, no stopping...
>>> >>>>>> The cluster is made of 6 hosts, 3 in one DC and 3 in another.
>>> >>>>>> Each host has 8G RAM.
>>> >>>>>> -Xmx=4G
>>> >>>>>> For some reason, the load isn't distributed evenly b/w the hosts,
>>> >>>>>> although
>>> >>>>>> I'm not sure this is the cause for slowness
>>> >>>>>> $ nodetool -h localhost -p 9004 ring
>>> >>>>>> Address       Status     Load          Range
>>> >>>>>>        Ring
>>> >>>>>>
>>> >>>>>> 144413773383729447702215082383444206680
>>> >>>>>> 192.168.252.99Up         15.94 GB
>>> >>>>>>  66002764663998929243644931915471302076     |<--|
>>> >>>>>> 192.168.254.57Up         19.84 GB
>>> >>>>>>  81288739225600737067856268063987022738     |   ^
>>> >>>>>> 192.168.254.58Up         973.78 MB
>>> >>>>>> 86999744104066390588161689990810839743     v   |
>>> >>>>>> 192.168.252.62Up         5.18 GB
>>> >>>>>> 88308919879653155454332084719458267849     |   ^
>>> >>>>>> 192.168.254.59Up         10.57 GB
>>> >>>>>>  142482163220375328195837946953175033937    v   |
>>> >>>>>> 192.168.252.61Up         11.36 GB
>>> >>>>>>  144413773383729447702215082383444206680    |-->|
>>> >>>>>> The slow host is 192.168.252.61 and it isn't the most loaded one.
>>> >>>>>> The host is waiting a lot on IO and the load average is usually
>>> 6-7
>>> >>>>>> $ w
>>> >>>>>>  00:42:56 up 11 days, 13:22,  1 user,  load average: 6.21, 5.52,
>>> 3.93
>>> >>>>>> $ vmstat 5
>>> >>>>>> procs -----------memory---------- ---swap-- -----io---- --system--
>>> >>>>>> -----cpu------
>>> >>>>>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs
>>> us
>>> >>>>>> sy id
>>> >>>>>> wa st
>>> >>>>>>  0  8 2147844  45744   1816 4457384    6    5    66    32    5
>>>  2  1
>>> >>>>>>  1
>>> >>>>>> 96  2  0
>>> >>>>>>  0  8 2147164  49020   1808 4451596  385    0  2345    58 3372
>>> 9957  2
>>> >>>>>>  2
>>> >>>>>> 78 18  0
>>> >>>>>>  0  3 2146432  45704   1812 4453956  342    0  2274   108 3937
>>> 10732
>>> >>>>>>  2  2
>>> >>>>>> 78 19  0
>>> >>>>>>  0  1 2146252  44696   1804 4453436  345  164  1939   294 3647
>>> 7833  2
>>> >>>>>>  2
>>> >>>>>> 78 18  0
>>> >>>>>>  0  1 2145960  46924   1744 4451260  158    0  2423   122 4354
>>> 14597
>>> >>>>>>  2  2
>>> >>>>>> 77 18  0
>>> >>>>>>  7  1 2138344  44676    952 4504148 1722  403  1722   406 1388
>>>  439 87
>>> >>>>>>  0
>>> >>>>>> 10  2  0
>>> >>>>>>  7  2 2137248  45652    956 4499436 1384  655  1384   658 1356
>>>  392 87
>>> >>>>>>  0
>>> >>>>>> 10  3  0
>>> >>>>>>  7  1 2135976  46764    956 4495020 1366  718  1366   718 1395
>>>  380 87
>>> >>>>>>  0
>>> >>>>>>  9  4  0
>>> >>>>>>  0  8 2134484  46964    956 4489420 1673  555  1814   586 1601
>>> 215590
>>> >>>>>> 14
>>> >>>>>>  2 68 16  0
>>> >>>>>>  0  1 2135388  47444    972 4488516  785  833  2390   995 3812
>>> 8305  2
>>> >>>>>>  2
>>> >>>>>> 77 20  0
>>> >>>>>>  0 10 2135164  45928    980 4488796  788  543  2275   626 36
>>> >>>>>> So, the host is swapping like crazy...
>>> >>>>>> top shows that it's using a lot of memory. As noted before -Xmx=4G
>>> and
>>> >>>>>> nothing else seems to be using a lot of memory on the host except
>>> for
>>> >>>>>> the
>>> >>>>>> cassandra process, however, of the 8G ram on the host, 92% is used
>>> by
>>> >>>>>> cassandra. How's that?
>>> >>>>>> Top shows there's 3.9g Shared and 7.2g Resident and 15.9g Virtual.
>>> Why
>>> >>>>>> does it have 15g virtual? And why 7.2 RES? This can explain the
>>> >>>>>> slowness in
>>> >>>>>> swapping.
>>> >>>>>> $ top
>>> >>>>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
>>>  COMMAND
>>> >>>>>>
>>> >>>>>>
>>> >>>>>> 20281 cassandr  25   0 15.9g 7.2g 3.9g S 33.3 92.6 175:30.27 java
>>> >>>>>> So, can the total memory be controlled?
>>> >>>>>> Or perhaps I'm looking in the wrong direction...
>>> >>>>>> I've looked at all the cassandra JMX counts and nothing seemed
>>> >>>>>> suspicious
>>> >>>>>> so far. By suspicious i mean a large number of pending tasks -
>>> there
>>> >>>>>> were
>>> >>>>>> always very small numbers in each pool.
>>> >>>>>> About read and write latencies, I'm not sure what the normal state
>>> is,
>>> >>>>>> but
>>> >>>>>> here's an example of what I see on the problematic host:
>>> >>>>>> #mbean = org.apache.cassandra.service:type=StorageProxy:
>>> >>>>>> RecentReadLatencyMicros = 30105.888180684495;
>>> >>>>>> TotalReadLatencyMicros = 78543052801;
>>> >>>>>> TotalWriteLatencyMicros = 4213118609;
>>> >>>>>> RecentWriteLatencyMicros = 1444.4809201925639;
>>> >>>>>> ReadOperations = 4779553;
>>> >>>>>> RangeOperations = 0;
>>> >>>>>> TotalRangeLatencyMicros = 0;
>>> >>>>>> RecentRangeLatencyMicros = NaN;
>>> >>>>>> WriteOperations = 4740093;
>>> >>>>>> And the only pool that I do see some pending tasks is the
>>> >>>>>> ROW-READ-STAGE,
>>> >>>>>> but it doesn't look like much, usually around 6-8:
>>> >>>>>> #mbean = org.apache.cassandra.concurrent:type=ROW-READ-STAGE:
>>> >>>>>> ActiveCount = 8;
>>> >>>>>> PendingTasks = 8;
>>> >>>>>> CompletedTasks = 5427955;
>>> >>>>>> Any help finding the solution is appreciated, thanks...
>>> >>>>>> Below are a few more JMXes I collected from the system that may be
>>> >>>>>> interesting.
>>> >>>>>> #mbean = java.lang:type=Memory:
>>> >>>>>> Verbose = false;
>>> >>>>>> HeapMemoryUsage = {
>>> >>>>>>   committed = 3767279616;
>>> >>>>>>   init = 134217728;
>>> >>>>>>   max = 4293656576;
>>> >>>>>>   used = 1237105080;
>>> >>>>>>  };
>>> >>>>>> NonHeapMemoryUsage = {
>>> >>>>>>   committed = 35061760;
>>> >>>>>>   init = 24313856;
>>> >>>>>>   max = 138412032;
>>> >>>>>>   used = 23151320;
>>> >>>>>>  };
>>> >>>>>> ObjectPendingFinalizationCount = 0;
>>> >>>>>> #mbean = java.lang:name=ParNew,type=GarbageCollector:
>>> >>>>>> LastGcInfo = {
>>> >>>>>>   GcThreadCount = 11;
>>> >>>>>>   duration = 136;
>>> >>>>>>   endTime = 42219272;
>>> >>>>>>   id = 11719;
>>> >>>>>>   memoryUsageAfterGc = {
>>> >>>>>>     ( CMS Perm Gen ) = {
>>> >>>>>>       key = CMS Perm Gen;
>>> >>>>>>       value = {
>>> >>>>>>         committed = 29229056;
>>> >>>>>>         init = 21757952;
>>> >>>>>>         max = 88080384;
>>> >>>>>>         used = 17648848;
>>> >>>>>>        };
>>> >>>>>>      };
>>> >>>>>>     ( Code Cache ) = {
>>> >>>>>>       key = Code Cache;
>>> >>>>>>       value = {
>>> >>>>>>         committed = 5832704;
>>> >>>>>>         init = 2555904;
>>> >>>>>>         max = 50331648;
>>> >>>>>>         used = 5563520;
>>> >>>>>>        };
>>> >>>>>>      };
>>> >>>>>>     ( CMS Old Gen ) = {
>>> >>>>>>       key = CMS Old Gen;
>>> >>>>>>       value = {
>>> >>>>>>         committed = 3594133504;
>>> >>>>>>         init = 112459776;
>>> >>>>>>         max = 4120510464;
>>> >>>>>>         used = 964565720;
>>> >>>>>>        };
>>> >>>>>>      };
>>> >>>>>>     ( Par Eden Space ) = {
>>> >>>>>>       key = Par Eden Space;
>>> >>>>>>       value = {
>>> >>>>>>         committed = 171835392;
>>> >>>>>>         init = 21495808;
>>> >>>>>>         max = 171835392;
>>> >>>>>>         used = 0;
>>> >>>>>>        };
>>> >>>>>>      };
>>> >>>>>>     ( Par Survivor Space ) = {
>>> >>>>>>       key = Par Survivor Space;
>>> >>>>>>       value = {
>>> >>>>>>         committed = 1310720;
>>> >>>>>>         init = 131072;
>>> >>>>>>         max = 1310720;
>>> >>>>>>         used = 0;
>>> >>>>>>        };
>>> >>>>>>      };
>>> >>>>>>    };
>>> >>>>>>   memoryUsageBeforeGc = {
>>> >>>>>>     ( CMS Perm Gen ) = {
>>> >>>>>>       key = CMS Perm Gen;
>>> >>>>>>       value = {
>>> >>>>>>         committed = 29229056;
>>> >>>>>>         init = 21757952;
>>> >>>>>>         max = 88080384;
>>> >>>>>>         used = 17648848;
>>> >>>>>>        };
>>> >>>>>>      };
>>> >>>>>>     ( Code Cache ) = {
>>> >>>>>>       key = Code Cache;
>>> >>>>>>       value = {
>>> >>>>>>         committed = 5832704;
>>> >>>>>>         init = 2555904;
>>> >>>>>>         max = 50331648;
>>> >>>>>>         used = 5563520;
>>> >>>>>>        };
>>> >>>>>>      };
>>> >>>>>>     ( CMS Old Gen ) = {
>>> >>>>>>       key = CMS Old Gen;
>>> >>>>>>       value = {
>>> >>>>>>         committed = 3594133504;
>>> >>>>>>         init = 112459776;
>>> >>>>>>         max = 4120510464;
>>> >>>>>>         used = 959221872;
>>> >>>>>>        };
>>> >>>>>>      };
>>> >>>>>>     ( Par Eden Space ) = {
>>> >>>>>>       key = Par Eden Space;
>>> >>>>>>       value = {
>>> >>>>>>         committed = 171835392;
>>> >>>>>>         init = 21495808;
>>> >>>>>>         max = 171835392;
>>> >>>>>>         used = 171835392;
>>> >>>>>>        };
>>> >>>>>>      };
>>> >>>>>>     ( Par Survivor Space ) = {
>>> >>>>>>       key = Par Survivor Space;
>>> >>>>>>       value = {
>>> >>>>>>         committed = 1310720;
>>> >>>>>>         init = 131072;
>>> >>>>>>         max = 1310720;
>>> >>>>>>         used = 0;
>>> >>>>>>        };
>>> >>>>>>      };
>>> >>>>>>    };
>>> >>>>>>   startTime = 42219136;
>>> >>>>>>  };
>>> >>>>>> CollectionCount = 11720;
>>> >>>>>> CollectionTime = 4561730;
>>> >>>>>> Name = ParNew;
>>> >>>>>> Valid = true;
>>> >>>>>> MemoryPoolNames = [ Par Eden Space, Par Survivor Space ];
>>> >>>>>> #mbean = java.lang:type=OperatingSystem:
>>> >>>>>> MaxFileDescriptorCount = 63536;
>>> >>>>>> OpenFileDescriptorCount = 75;
>>> >>>>>> CommittedVirtualMemorySize = 17787711488;
>>> >>>>>> FreePhysicalMemorySize = 45522944;
>>> >>>>>> FreeSwapSpaceSize = 2123968512;
>>> >>>>>> ProcessCpuTime = 12251460000000;
>>> >>>>>> TotalPhysicalMemorySize = 8364417024;
>>> >>>>>> TotalSwapSpaceSize = 4294959104;
>>> >>>>>> Name = Linux;
>>> >>>>>> AvailableProcessors = 8;
>>> >>>>>> Arch = amd64;
>>> >>>>>> SystemLoadAverage = 4.36;
>>> >>>>>> Version = 2.6.18-164.15.1.el5;
>>> >>>>>> #mbean = java.lang:type=Runtime:
>>> >>>>>> Name = 20281@ob1061.nydc1.outbrain.com;
>>> >>>>>>
>>> >>>>>> ClassPath =
>>> >>>>>>
>>> >>>>>>
>>> /outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../build/classes:/outbrain/cassandra/apache-cassandra-0.6.1/bin/..
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> /lib/antlr-3.1.3.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/apache-cassandra-0.6.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/avro-1.2.0-dev.jar:/outb
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> rain/cassandra/apache-cassandra-0.6.1/bin/../lib/clhm-production.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-cli-1.1.jar:/outbrain/cassandra/apache-cassandra-
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> 0.6.1/bin/../lib/commons-codec-1.2.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-collections-3.2.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/com
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> mons-lang-2.4.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/google-collections-1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/hadoop-core-0.20.1.jar:/out
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> brain/cassandra/apache-cassandra-0.6.1/bin/../lib/high-scale-lib.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/ivy-2.1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> bin/../lib/jackson-core-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jackson-mapper-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jline
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> -0.9.94.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/json-simple-1.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/libthrift-r917130.jar:/outbrain/cassandr
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> a/apache-cassandra-0.6.1/bin/../lib/log4j-1.2.14.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/slf4j-api-1.5.8.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib
>>> >>>>>> /slf4j-log4j12-1.5.8.jar;
>>> >>>>>>
>>> >>>>>> BootClassPath =
>>> >>>>>>
>>> >>>>>>
>>> /usr/java/jdk1.6.0_17/jre/lib/alt-rt.jar:/usr/java/jdk1.6.0_17/jre/lib/resources.jar:/usr/java/jdk1.6.0_17/jre/lib/rt.jar:/usr/java/jdk1.6.0_17/jre/lib/sunrsasign.j
>>> >>>>>>
>>> >>>>>>
>>> >>>>>>
>>> ar:/usr/java/jdk1.6.0_17/jre/lib/jsse.jar:/usr/java/jdk1.6.0_17/jre/lib/jce.jar:/usr/java/jdk1.6.0_17/jre/lib/charsets.jar:/usr/java/jdk1.6.0_17/jre/classes;
>>> >>>>>>
>>> >>>>>> LibraryPath =
>>> >>>>>>
>>> >>>>>>
>>> /usr/java/jdk1.6.0_17/jre/lib/amd64/server:/usr/java/jdk1.6.0_17/jre/lib/amd64:/usr/java/jdk1.6.0_17/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib;
>>> >>>>>>
>>> >>>>>> VmName = Java HotSpot(TM) 64-Bit Server VM;
>>> >>>>>>
>>> >>>>>> VmVendor = Sun Microsystems Inc.;
>>> >>>>>>
>>> >>>>>> VmVersion = 14.3-b01;
>>> >>>>>>
>>> >>>>>> BootClassPathSupported = true;
>>> >>>>>>
>>> >>>>>> InputArguments = [ -ea, -Xms128M, -Xmx4G,
>>> -XX:TargetSurvivorRatio=90,
>>> >>>>>> -XX:+AggressiveOpts, -XX:+UseParNewGC, -XX:+UseConcMarkSweepGC,
>>> >>>>>> -XX:+CMSParallelRemarkEnabled, -XX:+HeapDumpOnOutOfMemoryError,
>>> >>>>>> -XX:SurvivorRatio=128, -XX:MaxTenuringThreshold=0,
>>> >>>>>> -Dcom.sun.management.jmxremote.port=9004,
>>> >>>>>> -Dcom.sun.management.jmxremote.ssl=false,
>>> >>>>>> -Dcom.sun.management.jmxremote.authenticate=false,
>>> >>>>>>
>>> >>>>>>
>>> -Dstorage-config=/outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf,
>>> >>>>>> -Dcassandra-pidfile=/var/run/cassandra.pid ];
>>> >>>>>>
>>> >>>>>> ManagementSpecVersion = 1.2;
>>> >>>>>>
>>> >>>>>> SpecName = Java Virtual Machine Specification;
>>> >>>>>>
>>> >>>>>> SpecVendor = Sun Microsystems Inc.;
>>> >>>>>>
>>> >>>>>> SpecVersion = 1.0;
>>> >>>>>>
>>> >>>>>> StartTime = 1272911001415;
>>> >>>>>> ...
>>> >>>>>
>>> >>>
>>> >>>
>>> >>
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Jonathan Ellis
>>> > Project Chair, Apache Cassandra
>>> > co-founder of Riptano, the source for professional Cassandra support
>>> > http://riptano.com
>>>
>>>
>>
>

Re: performance tuning - where does the slowness come from?

Posted by Weijun Li <we...@gmail.com>.

When you have much more data than you can hold in memory, it will be
difficult for you to get around of swap which will most likely ruin your
performance. Also in this case mmap doesn't seem to make much sense if you
use random partitioner which will end up with crazy swap too. However we
found a way to get around read/write performance issue by integrating
memcached into Cassandra: in this case you need to ask memcached to disable
disk swap so you can achieve move than 10k read+write with milli-second
level of latency. Actually this is the only way that we figured out that can
gracefully solve the performance and memory issue.

-Weijun

On Wed, May 5, 2010 at 8:19 AM, Ran Tavory <ra...@gmail.com> wrote:

> I'm still trying to figure out where my slowness is coming from...
> By now I'm pretty sure it's the reads are slow, but not sure how to improve
> them.
>
> I'm looking at cfstats. Can you say if there are better configuration
> options? So far I've used all default settings, except for:
>
>     <Keyspace Name="outbrain_kvdb">
>       <ColumnFamily CompareWith="BytesType" Name="KvImpressions"
> KeysCached="50%"/>
>
>  <ReplicaPlacementStrategy>org.apache.cassandra.locator.RackAwareStrategy</ReplicaPlacementStrategy>
>       <ReplicationFactor>2</ReplicationFactor>
>
>  <EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch>
>     </Keyspace>
>
>
> What does a good read latency look like? I was expecting 10ms, however so
> far it seems that my KvImpressions read latency is 30ms and in the system
> keyspace I have 800ms :(
> I thought adding KeysCached="50%" would improve my situation but
> unfortunately looks like the hitrate is about 0. I realize that's
> application specific, but maybe there are other magic bullets...
>
> Is there something like adding cache to the system keyspace? 800 ms is
> pretty bad, isn't it?
>
> See stats below and thanks.
>
>
> Keyspace: outbrain_kvdb
>         Read Count: 651668
>         Read Latency: 34.18622328547666 ms.
>         Write Count: 655542
>         Write Latency: 0.041145092152752985 ms.
>         Pending Tasks: 0
>                 Column Family: KvImpressions
>                 SSTable count: 13
>                 Space used (live): 23304548897
>                 Space used (total): 23304548897
>                 Memtable Columns Count: 895
>                 Memtable Data Size: 2108990
>                 Memtable Switch Count: 8
>                 Read Count: 468083
>                 Read Latency: 151.603 ms.
>                 Write Count: 552566
>                 Write Latency: 0.023 ms.
>                 Pending Tasks: 0
>                 Key cache capacity: 17398656
>                 Key cache size: 567967
>                 Key cache hit rate: 0.0
>                 Row cache: disabled
>                 Compacted row minimum size: 269
>                 Compacted row maximum size: 54501
>                 Compacted row mean size: 933
> ...
> ----------------
> Keyspace: system
>         Read Count: 1151
>         Read Latency: 872.5014448305822 ms.
>         Write Count: 51215
>         Write Latency: 0.07156788050375866 ms.
>         Pending Tasks: 0
>                 Column Family: HintsColumnFamily
>                 SSTable count: 5
>                 Space used (live): 437366878
>                 Space used (total): 437366878
>                 Memtable Columns Count: 14987
>                 Memtable Data Size: 87975
>                 Memtable Switch Count: 2
>                 Read Count: 1150
>                 Read Latency: NaN ms.
>                 Write Count: 51211
>                 Write Latency: 0.027 ms.
>                 Pending Tasks: 0
>                 Key cache capacity: 6
>                 Key cache size: 4
>                 Key cache hit rate: NaN
>                 Row cache: disabled
>                 Compacted row minimum size: 0
>                 Compacted row maximum size: 0
>                 Compacted row mean size: 0
>
>                 Column Family: LocationInfo
>                 SSTable count: 2
>                 Space used (live): 3504
>                 Space used (total): 3504
>                 Memtable Columns Count: 0
>                 Memtable Data Size: 0
>                 Memtable Switch Count: 1
>                 Read Count: 1
>                 Read Latency: NaN ms.
>                 Write Count: 7
>                 Write Latency: NaN ms.
>                 Pending Tasks: 0
>                 Key cache capacity: 2
>                 Key cache size: 1
>                 Key cache hit rate: NaN
>                 Row cache: disabled
>                 Compacted row minimum size: 0
>                 Compacted row maximum size: 0
>                 Compacted row mean size: 0
>
>
> On Tue, May 4, 2010 at 10:57 PM, Kyusik Chung <ky...@discovereads.com>wrote:
>
>> Im using Ubuntu 8.04 on 64 bit hosts on rackspace cloud.
>>
>> Im in the middle of repeating some perf tests, but so far, I get as-good
>> or slightly better read perf by using standard disk access mode vs mmap.  So
>> far consecutive tests are returning consistent numbers.
>>
>> Im not sure how to explain it...maybe its an ubuntu 8.04 issue with mmap.
>>  Back when I was using mmap, I was definitely seeing the kswapd0 process
>> start using cpu as the box ran out of memory, and read performance
>> significantly degraded.
>>
>> Next, Ill run some tests with mmap_index_only, and Ill test with heavy
>> concurrent writes as well as reads.  Ill let everyone know what I find.
>>
>> Kyusik Chung
>> CEO, Discovereads.com
>> kyusik@discovereads.com
>>
>> On May 4, 2010, at 12:27 PM, Jonathan Ellis wrote:
>>
>> > Are you using 32 bit hosts?  If not don't be scared of mmap using a
>> > lot of address space, you have plenty.  It won't make you swap more
>> > than using buffered i/o.
>> >
>> > On Tue, May 4, 2010 at 1:57 PM, Ran Tavory <ra...@gmail.com> wrote:
>> >> I canceled mmap and indeed memory usage is sane again. So far
>> performance
>> >> hasn't been great, but I'll wait and see.
>> >> I'm also interested in a way to cap mmap so I can take advantage of it
>> but
>> >> not swap the host to death...
>> >>
>> >> On Tue, May 4, 2010 at 9:38 PM, Kyusik Chung <ky...@discovereads.com>
>> >> wrote:
>> >>>
>> >>> This sounds just like the slowness I was asking about in another
>> thread -
>> >>> after a lot of reads, the machine uses up all available memory on the
>> box
>> >>> and then starts swapping.
>> >>> My understanding was that mmap helps greatly with read and write perf
>> >>> (until the box starts swapping I guess)...is there any way to use mmap
>> and
>> >>> cap how much memory it takes up?
>> >>> What do people use in production?  mmap or no mmap?
>> >>> Thanks!
>> >>> Kyusik Chung
>> >>> On May 4, 2010, at 10:11 AM, Schubert Zhang wrote:
>> >>>
>> >>> 1. When initially startup your nodes, please plan your InitialToken of
>> >>> each node evenly.
>> >>> 2. <DiskAccessMode>standard</DiskAccessMode>
>> >>>
>> >>> On Tue, May 4, 2010 at 9:09 PM, Boris Shulman <sh...@gmail.com>
>> wrote:
>> >>>>
>> >>>> I think that the extra (more than 4GB) memory usage comes from the
>> >>>> mmaped io, that is why it happens only for reads.
>> >>>>
>> >>>> On Tue, May 4, 2010 at 2:02 PM, Jordan Pittier <
>> jordan.pittier@gmail.com>
>> >>>> wrote:
>> >>>>> I'm facing the same issue with swap. It only occurs when I perform
>> read
>> >>>>> operations (write are very fast :)). So I can't help you with the
>> >>>>> memory
>> >>>>> probleme.
>> >>>>>
>> >>>>> But to balance the load evenly between nodes in cluster just
>> manually
>> >>>>> fix
>> >>>>> their token.(the "formula" is i * 2^127 / nb_nodes).
>> >>>>>
>> >>>>> Jordzn
>> >>>>>
>> >>>>> On Tue, May 4, 2010 at 8:20 AM, Ran Tavory <ra...@gmail.com>
>> wrote:
>> >>>>>>
>> >>>>>> I'm looking into performance issues on a 0.6.1 cluster. I see two
>> >>>>>> symptoms:
>> >>>>>> 1. Reads and writes are slow
>> >>>>>> 2. One of the hosts is doing a lot of GC.
>> >>>>>> 1 is slow in the sense that in normal state the cluster used to
>> make
>> >>>>>> around 3-5k read and writes per second (6-10k operations per
>> second),
>> >>>>>> but
>> >>>>>> how it's in the order of 200-400 ops per second, sometimes even
>> less.
>> >>>>>> 2 looks like this:
>> >>>>>> $ tail -f /outbrain/cassandra/log/system.log
>> >>>>>>  INFO [GC inspection] 2010-05-04 00:42:18,636 GCInspector.java
>> (line
>> >>>>>> 110)
>> >>>>>> GC for ParNew: 672 ms, 166482384 reclaimed leaving 2872087208 used;
>> >>>>>> max is
>> >>>>>> 4432068608
>> >>>>>>  INFO [GC inspection] 2010-05-04 00:42:28,638 GCInspector.java
>> (line
>> >>>>>> 110)
>> >>>>>> GC for ParNew: 498 ms, 166493352 reclaimed leaving 2836049448 used;
>> >>>>>> max is
>> >>>>>> 4432068608
>> >>>>>>  INFO [GC inspection] 2010-05-04 00:42:38,640 GCInspector.java
>> (line
>> >>>>>> 110)
>> >>>>>> GC for ParNew: 327 ms, 166091528 reclaimed leaving 2796888424 used;
>> >>>>>> max is
>> >>>>>> 4432068608
>> >>>>>> ... and it goes on and on for hours, no stopping...
>> >>>>>> The cluster is made of 6 hosts, 3 in one DC and 3 in another.
>> >>>>>> Each host has 8G RAM.
>> >>>>>> -Xmx=4G
>> >>>>>> For some reason, the load isn't distributed evenly b/w the hosts,
>> >>>>>> although
>> >>>>>> I'm not sure this is the cause for slowness
>> >>>>>> $ nodetool -h localhost -p 9004 ring
>> >>>>>> Address       Status     Load          Range
>> >>>>>>        Ring
>> >>>>>>
>> >>>>>> 144413773383729447702215082383444206680
>> >>>>>> 192.168.252.99Up         15.94 GB
>> >>>>>>  66002764663998929243644931915471302076     |<--|
>> >>>>>> 192.168.254.57Up         19.84 GB
>> >>>>>>  81288739225600737067856268063987022738     |   ^
>> >>>>>> 192.168.254.58Up         973.78 MB
>> >>>>>> 86999744104066390588161689990810839743     v   |
>> >>>>>> 192.168.252.62Up         5.18 GB
>> >>>>>> 88308919879653155454332084719458267849     |   ^
>> >>>>>> 192.168.254.59Up         10.57 GB
>> >>>>>>  142482163220375328195837946953175033937    v   |
>> >>>>>> 192.168.252.61Up         11.36 GB
>> >>>>>>  144413773383729447702215082383444206680    |-->|
>> >>>>>> The slow host is 192.168.252.61 and it isn't the most loaded one.
>> >>>>>> The host is waiting a lot on IO and the load average is usually 6-7
>> >>>>>> $ w
>> >>>>>>  00:42:56 up 11 days, 13:22,  1 user,  load average: 6.21, 5.52,
>> 3.93
>> >>>>>> $ vmstat 5
>> >>>>>> procs -----------memory---------- ---swap-- -----io---- --system--
>> >>>>>> -----cpu------
>> >>>>>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs
>> us
>> >>>>>> sy id
>> >>>>>> wa st
>> >>>>>>  0  8 2147844  45744   1816 4457384    6    5    66    32    5    2
>>  1
>> >>>>>>  1
>> >>>>>> 96  2  0
>> >>>>>>  0  8 2147164  49020   1808 4451596  385    0  2345    58 3372 9957
>>  2
>> >>>>>>  2
>> >>>>>> 78 18  0
>> >>>>>>  0  3 2146432  45704   1812 4453956  342    0  2274   108 3937
>> 10732
>> >>>>>>  2  2
>> >>>>>> 78 19  0
>> >>>>>>  0  1 2146252  44696   1804 4453436  345  164  1939   294 3647 7833
>>  2
>> >>>>>>  2
>> >>>>>> 78 18  0
>> >>>>>>  0  1 2145960  46924   1744 4451260  158    0  2423   122 4354
>> 14597
>> >>>>>>  2  2
>> >>>>>> 77 18  0
>> >>>>>>  7  1 2138344  44676    952 4504148 1722  403  1722   406 1388  439
>> 87
>> >>>>>>  0
>> >>>>>> 10  2  0
>> >>>>>>  7  2 2137248  45652    956 4499436 1384  655  1384   658 1356  392
>> 87
>> >>>>>>  0
>> >>>>>> 10  3  0
>> >>>>>>  7  1 2135976  46764    956 4495020 1366  718  1366   718 1395  380
>> 87
>> >>>>>>  0
>> >>>>>>  9  4  0
>> >>>>>>  0  8 2134484  46964    956 4489420 1673  555  1814   586 1601
>> 215590
>> >>>>>> 14
>> >>>>>>  2 68 16  0
>> >>>>>>  0  1 2135388  47444    972 4488516  785  833  2390   995 3812 8305
>>  2
>> >>>>>>  2
>> >>>>>> 77 20  0
>> >>>>>>  0 10 2135164  45928    980 4488796  788  543  2275   626 36
>> >>>>>> So, the host is swapping like crazy...
>> >>>>>> top shows that it's using a lot of memory. As noted before -Xmx=4G
>> and
>> >>>>>> nothing else seems to be using a lot of memory on the host except
>> for
>> >>>>>> the
>> >>>>>> cassandra process, however, of the 8G ram on the host, 92% is used
>> by
>> >>>>>> cassandra. How's that?
>> >>>>>> Top shows there's 3.9g Shared and 7.2g Resident and 15.9g Virtual.
>> Why
>> >>>>>> does it have 15g virtual? And why 7.2 RES? This can explain the
>> >>>>>> slowness in
>> >>>>>> swapping.
>> >>>>>> $ top
>> >>>>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
>>  COMMAND
>> >>>>>>
>> >>>>>>
>> >>>>>> 20281 cassandr  25   0 15.9g 7.2g 3.9g S 33.3 92.6 175:30.27 java
>> >>>>>> So, can the total memory be controlled?
>> >>>>>> Or perhaps I'm looking in the wrong direction...
>> >>>>>> I've looked at all the cassandra JMX counts and nothing seemed
>> >>>>>> suspicious
>> >>>>>> so far. By suspicious i mean a large number of pending tasks -
>> there
>> >>>>>> were
>> >>>>>> always very small numbers in each pool.
>> >>>>>> About read and write latencies, I'm not sure what the normal state
>> is,
>> >>>>>> but
>> >>>>>> here's an example of what I see on the problematic host:
>> >>>>>> #mbean = org.apache.cassandra.service:type=StorageProxy:
>> >>>>>> RecentReadLatencyMicros = 30105.888180684495;
>> >>>>>> TotalReadLatencyMicros = 78543052801;
>> >>>>>> TotalWriteLatencyMicros = 4213118609;
>> >>>>>> RecentWriteLatencyMicros = 1444.4809201925639;
>> >>>>>> ReadOperations = 4779553;
>> >>>>>> RangeOperations = 0;
>> >>>>>> TotalRangeLatencyMicros = 0;
>> >>>>>> RecentRangeLatencyMicros = NaN;
>> >>>>>> WriteOperations = 4740093;
>> >>>>>> And the only pool that I do see some pending tasks is the
>> >>>>>> ROW-READ-STAGE,
>> >>>>>> but it doesn't look like much, usually around 6-8:
>> >>>>>> #mbean = org.apache.cassandra.concurrent:type=ROW-READ-STAGE:
>> >>>>>> ActiveCount = 8;
>> >>>>>> PendingTasks = 8;
>> >>>>>> CompletedTasks = 5427955;
>> >>>>>> Any help finding the solution is appreciated, thanks...
>> >>>>>> Below are a few more JMXes I collected from the system that may be
>> >>>>>> interesting.
>> >>>>>> #mbean = java.lang:type=Memory:
>> >>>>>> Verbose = false;
>> >>>>>> HeapMemoryUsage = {
>> >>>>>>   committed = 3767279616;
>> >>>>>>   init = 134217728;
>> >>>>>>   max = 4293656576;
>> >>>>>>   used = 1237105080;
>> >>>>>>  };
>> >>>>>> NonHeapMemoryUsage = {
>> >>>>>>   committed = 35061760;
>> >>>>>>   init = 24313856;
>> >>>>>>   max = 138412032;
>> >>>>>>   used = 23151320;
>> >>>>>>  };
>> >>>>>> ObjectPendingFinalizationCount = 0;
>> >>>>>> #mbean = java.lang:name=ParNew,type=GarbageCollector:
>> >>>>>> LastGcInfo = {
>> >>>>>>   GcThreadCount = 11;
>> >>>>>>   duration = 136;
>> >>>>>>   endTime = 42219272;
>> >>>>>>   id = 11719;
>> >>>>>>   memoryUsageAfterGc = {
>> >>>>>>     ( CMS Perm Gen ) = {
>> >>>>>>       key = CMS Perm Gen;
>> >>>>>>       value = {
>> >>>>>>         committed = 29229056;
>> >>>>>>         init = 21757952;
>> >>>>>>         max = 88080384;
>> >>>>>>         used = 17648848;
>> >>>>>>        };
>> >>>>>>      };
>> >>>>>>     ( Code Cache ) = {
>> >>>>>>       key = Code Cache;
>> >>>>>>       value = {
>> >>>>>>         committed = 5832704;
>> >>>>>>         init = 2555904;
>> >>>>>>         max = 50331648;
>> >>>>>>         used = 5563520;
>> >>>>>>        };
>> >>>>>>      };
>> >>>>>>     ( CMS Old Gen ) = {
>> >>>>>>       key = CMS Old Gen;
>> >>>>>>       value = {
>> >>>>>>         committed = 3594133504;
>> >>>>>>         init = 112459776;
>> >>>>>>         max = 4120510464;
>> >>>>>>         used = 964565720;
>> >>>>>>        };
>> >>>>>>      };
>> >>>>>>     ( Par Eden Space ) = {
>> >>>>>>       key = Par Eden Space;
>> >>>>>>       value = {
>> >>>>>>         committed = 171835392;
>> >>>>>>         init = 21495808;
>> >>>>>>         max = 171835392;
>> >>>>>>         used = 0;
>> >>>>>>        };
>> >>>>>>      };
>> >>>>>>     ( Par Survivor Space ) = {
>> >>>>>>       key = Par Survivor Space;
>> >>>>>>       value = {
>> >>>>>>         committed = 1310720;
>> >>>>>>         init = 131072;
>> >>>>>>         max = 1310720;
>> >>>>>>         used = 0;
>> >>>>>>        };
>> >>>>>>      };
>> >>>>>>    };
>> >>>>>>   memoryUsageBeforeGc = {
>> >>>>>>     ( CMS Perm Gen ) = {
>> >>>>>>       key = CMS Perm Gen;
>> >>>>>>       value = {
>> >>>>>>         committed = 29229056;
>> >>>>>>         init = 21757952;
>> >>>>>>         max = 88080384;
>> >>>>>>         used = 17648848;
>> >>>>>>        };
>> >>>>>>      };
>> >>>>>>     ( Code Cache ) = {
>> >>>>>>       key = Code Cache;
>> >>>>>>       value = {
>> >>>>>>         committed = 5832704;
>> >>>>>>         init = 2555904;
>> >>>>>>         max = 50331648;
>> >>>>>>         used = 5563520;
>> >>>>>>        };
>> >>>>>>      };
>> >>>>>>     ( CMS Old Gen ) = {
>> >>>>>>       key = CMS Old Gen;
>> >>>>>>       value = {
>> >>>>>>         committed = 3594133504;
>> >>>>>>         init = 112459776;
>> >>>>>>         max = 4120510464;
>> >>>>>>         used = 959221872;
>> >>>>>>        };
>> >>>>>>      };
>> >>>>>>     ( Par Eden Space ) = {
>> >>>>>>       key = Par Eden Space;
>> >>>>>>       value = {
>> >>>>>>         committed = 171835392;
>> >>>>>>         init = 21495808;
>> >>>>>>         max = 171835392;
>> >>>>>>         used = 171835392;
>> >>>>>>        };
>> >>>>>>      };
>> >>>>>>     ( Par Survivor Space ) = {
>> >>>>>>       key = Par Survivor Space;
>> >>>>>>       value = {
>> >>>>>>         committed = 1310720;
>> >>>>>>         init = 131072;
>> >>>>>>         max = 1310720;
>> >>>>>>         used = 0;
>> >>>>>>        };
>> >>>>>>      };
>> >>>>>>    };
>> >>>>>>   startTime = 42219136;
>> >>>>>>  };
>> >>>>>> CollectionCount = 11720;
>> >>>>>> CollectionTime = 4561730;
>> >>>>>> Name = ParNew;
>> >>>>>> Valid = true;
>> >>>>>> MemoryPoolNames = [ Par Eden Space, Par Survivor Space ];
>> >>>>>> #mbean = java.lang:type=OperatingSystem:
>> >>>>>> MaxFileDescriptorCount = 63536;
>> >>>>>> OpenFileDescriptorCount = 75;
>> >>>>>> CommittedVirtualMemorySize = 17787711488;
>> >>>>>> FreePhysicalMemorySize = 45522944;
>> >>>>>> FreeSwapSpaceSize = 2123968512;
>> >>>>>> ProcessCpuTime = 12251460000000;
>> >>>>>> TotalPhysicalMemorySize = 8364417024;
>> >>>>>> TotalSwapSpaceSize = 4294959104;
>> >>>>>> Name = Linux;
>> >>>>>> AvailableProcessors = 8;
>> >>>>>> Arch = amd64;
>> >>>>>> SystemLoadAverage = 4.36;
>> >>>>>> Version = 2.6.18-164.15.1.el5;
>> >>>>>> #mbean = java.lang:type=Runtime:
>> >>>>>> Name = 20281@ob1061.nydc1.outbrain.com;
>> >>>>>>
>> >>>>>> ClassPath =
>> >>>>>>
>> >>>>>>
>> /outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../build/classes:/outbrain/cassandra/apache-cassandra-0.6.1/bin/..
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> /lib/antlr-3.1.3.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/apache-cassandra-0.6.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/avro-1.2.0-dev.jar:/outb
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> rain/cassandra/apache-cassandra-0.6.1/bin/../lib/clhm-production.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-cli-1.1.jar:/outbrain/cassandra/apache-cassandra-
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> 0.6.1/bin/../lib/commons-codec-1.2.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-collections-3.2.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/com
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> mons-lang-2.4.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/google-collections-1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/hadoop-core-0.20.1.jar:/out
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> brain/cassandra/apache-cassandra-0.6.1/bin/../lib/high-scale-lib.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/ivy-2.1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> bin/../lib/jackson-core-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jackson-mapper-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jline
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> -0.9.94.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/json-simple-1.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/libthrift-r917130.jar:/outbrain/cassandr
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> a/apache-cassandra-0.6.1/bin/../lib/log4j-1.2.14.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/slf4j-api-1.5.8.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib
>> >>>>>> /slf4j-log4j12-1.5.8.jar;
>> >>>>>>
>> >>>>>> BootClassPath =
>> >>>>>>
>> >>>>>>
>> /usr/java/jdk1.6.0_17/jre/lib/alt-rt.jar:/usr/java/jdk1.6.0_17/jre/lib/resources.jar:/usr/java/jdk1.6.0_17/jre/lib/rt.jar:/usr/java/jdk1.6.0_17/jre/lib/sunrsasign.j
>> >>>>>>
>> >>>>>>
>> >>>>>>
>> ar:/usr/java/jdk1.6.0_17/jre/lib/jsse.jar:/usr/java/jdk1.6.0_17/jre/lib/jce.jar:/usr/java/jdk1.6.0_17/jre/lib/charsets.jar:/usr/java/jdk1.6.0_17/jre/classes;
>> >>>>>>
>> >>>>>> LibraryPath =
>> >>>>>>
>> >>>>>>
>> /usr/java/jdk1.6.0_17/jre/lib/amd64/server:/usr/java/jdk1.6.0_17/jre/lib/amd64:/usr/java/jdk1.6.0_17/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib;
>> >>>>>>
>> >>>>>> VmName = Java HotSpot(TM) 64-Bit Server VM;
>> >>>>>>
>> >>>>>> VmVendor = Sun Microsystems Inc.;
>> >>>>>>
>> >>>>>> VmVersion = 14.3-b01;
>> >>>>>>
>> >>>>>> BootClassPathSupported = true;
>> >>>>>>
>> >>>>>> InputArguments = [ -ea, -Xms128M, -Xmx4G,
>> -XX:TargetSurvivorRatio=90,
>> >>>>>> -XX:+AggressiveOpts, -XX:+UseParNewGC, -XX:+UseConcMarkSweepGC,
>> >>>>>> -XX:+CMSParallelRemarkEnabled, -XX:+HeapDumpOnOutOfMemoryError,
>> >>>>>> -XX:SurvivorRatio=128, -XX:MaxTenuringThreshold=0,
>> >>>>>> -Dcom.sun.management.jmxremote.port=9004,
>> >>>>>> -Dcom.sun.management.jmxremote.ssl=false,
>> >>>>>> -Dcom.sun.management.jmxremote.authenticate=false,
>> >>>>>>
>> >>>>>>
>> -Dstorage-config=/outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf,
>> >>>>>> -Dcassandra-pidfile=/var/run/cassandra.pid ];
>> >>>>>>
>> >>>>>> ManagementSpecVersion = 1.2;
>> >>>>>>
>> >>>>>> SpecName = Java Virtual Machine Specification;
>> >>>>>>
>> >>>>>> SpecVendor = Sun Microsystems Inc.;
>> >>>>>>
>> >>>>>> SpecVersion = 1.0;
>> >>>>>>
>> >>>>>> StartTime = 1272911001415;
>> >>>>>> ...
>> >>>>>
>> >>>
>> >>>
>> >>
>> >>
>> >
>> >
>> >
>> > --
>> > Jonathan Ellis
>> > Project Chair, Apache Cassandra
>> > co-founder of Riptano, the source for professional Cassandra support
>> > http://riptano.com
>>
>>
>

Re: performance tuning - where does the slowness come from?

Posted by Ran Tavory <ra...@gmail.com>.

I'm still trying to figure out where my slowness is coming from...
By now I'm pretty sure it's the reads are slow, but not sure how to improve
them.

I'm looking at cfstats. Can you say if there are better configuration
options? So far I've used all default settings, except for:

    <Keyspace Name="outbrain_kvdb">
      <ColumnFamily CompareWith="BytesType" Name="KvImpressions"
KeysCached="50%"/>

 <ReplicaPlacementStrategy>org.apache.cassandra.locator.RackAwareStrategy</ReplicaPlacementStrategy>
      <ReplicationFactor>2</ReplicationFactor>

 <EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch>
    </Keyspace>


What does a good read latency look like? I was expecting 10ms, however so
far it seems that my KvImpressions read latency is 30ms and in the system
keyspace I have 800ms :(
I thought adding KeysCached="50%" would improve my situation but
unfortunately looks like the hitrate is about 0. I realize that's
application specific, but maybe there are other magic bullets...

Is there something like adding cache to the system keyspace? 800 ms is
pretty bad, isn't it?

See stats below and thanks.


Keyspace: outbrain_kvdb
        Read Count: 651668
        Read Latency: 34.18622328547666 ms.
        Write Count: 655542
        Write Latency: 0.041145092152752985 ms.
        Pending Tasks: 0
                Column Family: KvImpressions
                SSTable count: 13
                Space used (live): 23304548897
                Space used (total): 23304548897
                Memtable Columns Count: 895
                Memtable Data Size: 2108990
                Memtable Switch Count: 8
                Read Count: 468083
                Read Latency: 151.603 ms.
                Write Count: 552566
                Write Latency: 0.023 ms.
                Pending Tasks: 0
                Key cache capacity: 17398656
                Key cache size: 567967
                Key cache hit rate: 0.0
                Row cache: disabled
                Compacted row minimum size: 269
                Compacted row maximum size: 54501
                Compacted row mean size: 933
...
----------------
Keyspace: system
        Read Count: 1151
        Read Latency: 872.5014448305822 ms.
        Write Count: 51215
        Write Latency: 0.07156788050375866 ms.
        Pending Tasks: 0
                Column Family: HintsColumnFamily
                SSTable count: 5
                Space used (live): 437366878
                Space used (total): 437366878
                Memtable Columns Count: 14987
                Memtable Data Size: 87975
                Memtable Switch Count: 2
                Read Count: 1150
                Read Latency: NaN ms.
                Write Count: 51211
                Write Latency: 0.027 ms.
                Pending Tasks: 0
                Key cache capacity: 6
                Key cache size: 4
                Key cache hit rate: NaN
                Row cache: disabled
                Compacted row minimum size: 0
                Compacted row maximum size: 0
                Compacted row mean size: 0

                Column Family: LocationInfo
                SSTable count: 2
                Space used (live): 3504
                Space used (total): 3504
                Memtable Columns Count: 0
                Memtable Data Size: 0
                Memtable Switch Count: 1
                Read Count: 1
                Read Latency: NaN ms.
                Write Count: 7
                Write Latency: NaN ms.
                Pending Tasks: 0
                Key cache capacity: 2
                Key cache size: 1
                Key cache hit rate: NaN
                Row cache: disabled
                Compacted row minimum size: 0
                Compacted row maximum size: 0
                Compacted row mean size: 0


On Tue, May 4, 2010 at 10:57 PM, Kyusik Chung <ky...@discovereads.com>wrote:

> Im using Ubuntu 8.04 on 64 bit hosts on rackspace cloud.
>
> Im in the middle of repeating some perf tests, but so far, I get as-good or
> slightly better read perf by using standard disk access mode vs mmap.  So
> far consecutive tests are returning consistent numbers.
>
> Im not sure how to explain it...maybe its an ubuntu 8.04 issue with mmap.
>  Back when I was using mmap, I was definitely seeing the kswapd0 process
> start using cpu as the box ran out of memory, and read performance
> significantly degraded.
>
> Next, Ill run some tests with mmap_index_only, and Ill test with heavy
> concurrent writes as well as reads.  Ill let everyone know what I find.
>
> Kyusik Chung
> CEO, Discovereads.com
> kyusik@discovereads.com
>
> On May 4, 2010, at 12:27 PM, Jonathan Ellis wrote:
>
> > Are you using 32 bit hosts?  If not don't be scared of mmap using a
> > lot of address space, you have plenty.  It won't make you swap more
> > than using buffered i/o.
> >
> > On Tue, May 4, 2010 at 1:57 PM, Ran Tavory <ra...@gmail.com> wrote:
> >> I canceled mmap and indeed memory usage is sane again. So far
> performance
> >> hasn't been great, but I'll wait and see.
> >> I'm also interested in a way to cap mmap so I can take advantage of it
> but
> >> not swap the host to death...
> >>
> >> On Tue, May 4, 2010 at 9:38 PM, Kyusik Chung <ky...@discovereads.com>
> >> wrote:
> >>>
> >>> This sounds just like the slowness I was asking about in another thread
> -
> >>> after a lot of reads, the machine uses up all available memory on the
> box
> >>> and then starts swapping.
> >>> My understanding was that mmap helps greatly with read and write perf
> >>> (until the box starts swapping I guess)...is there any way to use mmap
> and
> >>> cap how much memory it takes up?
> >>> What do people use in production?  mmap or no mmap?
> >>> Thanks!
> >>> Kyusik Chung
> >>> On May 4, 2010, at 10:11 AM, Schubert Zhang wrote:
> >>>
> >>> 1. When initially startup your nodes, please plan your InitialToken of
> >>> each node evenly.
> >>> 2. <DiskAccessMode>standard</DiskAccessMode>
> >>>
> >>> On Tue, May 4, 2010 at 9:09 PM, Boris Shulman <sh...@gmail.com>
> wrote:
> >>>>
> >>>> I think that the extra (more than 4GB) memory usage comes from the
> >>>> mmaped io, that is why it happens only for reads.
> >>>>
> >>>> On Tue, May 4, 2010 at 2:02 PM, Jordan Pittier <
> jordan.pittier@gmail.com>
> >>>> wrote:
> >>>>> I'm facing the same issue with swap. It only occurs when I perform
> read
> >>>>> operations (write are very fast :)). So I can't help you with the
> >>>>> memory
> >>>>> probleme.
> >>>>>
> >>>>> But to balance the load evenly between nodes in cluster just manually
> >>>>> fix
> >>>>> their token.(the "formula" is i * 2^127 / nb_nodes).
> >>>>>
> >>>>> Jordzn
> >>>>>
> >>>>> On Tue, May 4, 2010 at 8:20 AM, Ran Tavory <ra...@gmail.com> wrote:
> >>>>>>
> >>>>>> I'm looking into performance issues on a 0.6.1 cluster. I see two
> >>>>>> symptoms:
> >>>>>> 1. Reads and writes are slow
> >>>>>> 2. One of the hosts is doing a lot of GC.
> >>>>>> 1 is slow in the sense that in normal state the cluster used to make
> >>>>>> around 3-5k read and writes per second (6-10k operations per
> second),
> >>>>>> but
> >>>>>> how it's in the order of 200-400 ops per second, sometimes even
> less.
> >>>>>> 2 looks like this:
> >>>>>> $ tail -f /outbrain/cassandra/log/system.log
> >>>>>>  INFO [GC inspection] 2010-05-04 00:42:18,636 GCInspector.java (line
> >>>>>> 110)
> >>>>>> GC for ParNew: 672 ms, 166482384 reclaimed leaving 2872087208 used;
> >>>>>> max is
> >>>>>> 4432068608
> >>>>>>  INFO [GC inspection] 2010-05-04 00:42:28,638 GCInspector.java (line
> >>>>>> 110)
> >>>>>> GC for ParNew: 498 ms, 166493352 reclaimed leaving 2836049448 used;
> >>>>>> max is
> >>>>>> 4432068608
> >>>>>>  INFO [GC inspection] 2010-05-04 00:42:38,640 GCInspector.java (line
> >>>>>> 110)
> >>>>>> GC for ParNew: 327 ms, 166091528 reclaimed leaving 2796888424 used;
> >>>>>> max is
> >>>>>> 4432068608
> >>>>>> ... and it goes on and on for hours, no stopping...
> >>>>>> The cluster is made of 6 hosts, 3 in one DC and 3 in another.
> >>>>>> Each host has 8G RAM.
> >>>>>> -Xmx=4G
> >>>>>> For some reason, the load isn't distributed evenly b/w the hosts,
> >>>>>> although
> >>>>>> I'm not sure this is the cause for slowness
> >>>>>> $ nodetool -h localhost -p 9004 ring
> >>>>>> Address       Status     Load          Range
> >>>>>>        Ring
> >>>>>>
> >>>>>> 144413773383729447702215082383444206680
> >>>>>> 192.168.252.99Up         15.94 GB
> >>>>>>  66002764663998929243644931915471302076     |<--|
> >>>>>> 192.168.254.57Up         19.84 GB
> >>>>>>  81288739225600737067856268063987022738     |   ^
> >>>>>> 192.168.254.58Up         973.78 MB
> >>>>>> 86999744104066390588161689990810839743     v   |
> >>>>>> 192.168.252.62Up         5.18 GB
> >>>>>> 88308919879653155454332084719458267849     |   ^
> >>>>>> 192.168.254.59Up         10.57 GB
> >>>>>>  142482163220375328195837946953175033937    v   |
> >>>>>> 192.168.252.61Up         11.36 GB
> >>>>>>  144413773383729447702215082383444206680    |-->|
> >>>>>> The slow host is 192.168.252.61 and it isn't the most loaded one.
> >>>>>> The host is waiting a lot on IO and the load average is usually 6-7
> >>>>>> $ w
> >>>>>>  00:42:56 up 11 days, 13:22,  1 user,  load average: 6.21, 5.52,
> 3.93
> >>>>>> $ vmstat 5
> >>>>>> procs -----------memory---------- ---swap-- -----io---- --system--
> >>>>>> -----cpu------
> >>>>>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us
> >>>>>> sy id
> >>>>>> wa st
> >>>>>>  0  8 2147844  45744   1816 4457384    6    5    66    32    5    2
>  1
> >>>>>>  1
> >>>>>> 96  2  0
> >>>>>>  0  8 2147164  49020   1808 4451596  385    0  2345    58 3372 9957
>  2
> >>>>>>  2
> >>>>>> 78 18  0
> >>>>>>  0  3 2146432  45704   1812 4453956  342    0  2274   108 3937 10732
> >>>>>>  2  2
> >>>>>> 78 19  0
> >>>>>>  0  1 2146252  44696   1804 4453436  345  164  1939   294 3647 7833
>  2
> >>>>>>  2
> >>>>>> 78 18  0
> >>>>>>  0  1 2145960  46924   1744 4451260  158    0  2423   122 4354 14597
> >>>>>>  2  2
> >>>>>> 77 18  0
> >>>>>>  7  1 2138344  44676    952 4504148 1722  403  1722   406 1388  439
> 87
> >>>>>>  0
> >>>>>> 10  2  0
> >>>>>>  7  2 2137248  45652    956 4499436 1384  655  1384   658 1356  392
> 87
> >>>>>>  0
> >>>>>> 10  3  0
> >>>>>>  7  1 2135976  46764    956 4495020 1366  718  1366   718 1395  380
> 87
> >>>>>>  0
> >>>>>>  9  4  0
> >>>>>>  0  8 2134484  46964    956 4489420 1673  555  1814   586 1601
> 215590
> >>>>>> 14
> >>>>>>  2 68 16  0
> >>>>>>  0  1 2135388  47444    972 4488516  785  833  2390   995 3812 8305
>  2
> >>>>>>  2
> >>>>>> 77 20  0
> >>>>>>  0 10 2135164  45928    980 4488796  788  543  2275   626 36
> >>>>>> So, the host is swapping like crazy...
> >>>>>> top shows that it's using a lot of memory. As noted before -Xmx=4G
> and
> >>>>>> nothing else seems to be using a lot of memory on the host except
> for
> >>>>>> the
> >>>>>> cassandra process, however, of the 8G ram on the host, 92% is used
> by
> >>>>>> cassandra. How's that?
> >>>>>> Top shows there's 3.9g Shared and 7.2g Resident and 15.9g Virtual.
> Why
> >>>>>> does it have 15g virtual? And why 7.2 RES? This can explain the
> >>>>>> slowness in
> >>>>>> swapping.
> >>>>>> $ top
> >>>>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> >>>>>>
> >>>>>>
> >>>>>> 20281 cassandr  25   0 15.9g 7.2g 3.9g S 33.3 92.6 175:30.27 java
> >>>>>> So, can the total memory be controlled?
> >>>>>> Or perhaps I'm looking in the wrong direction...
> >>>>>> I've looked at all the cassandra JMX counts and nothing seemed
> >>>>>> suspicious
> >>>>>> so far. By suspicious i mean a large number of pending tasks - there
> >>>>>> were
> >>>>>> always very small numbers in each pool.
> >>>>>> About read and write latencies, I'm not sure what the normal state
> is,
> >>>>>> but
> >>>>>> here's an example of what I see on the problematic host:
> >>>>>> #mbean = org.apache.cassandra.service:type=StorageProxy:
> >>>>>> RecentReadLatencyMicros = 30105.888180684495;
> >>>>>> TotalReadLatencyMicros = 78543052801;
> >>>>>> TotalWriteLatencyMicros = 4213118609;
> >>>>>> RecentWriteLatencyMicros = 1444.4809201925639;
> >>>>>> ReadOperations = 4779553;
> >>>>>> RangeOperations = 0;
> >>>>>> TotalRangeLatencyMicros = 0;
> >>>>>> RecentRangeLatencyMicros = NaN;
> >>>>>> WriteOperations = 4740093;
> >>>>>> And the only pool that I do see some pending tasks is the
> >>>>>> ROW-READ-STAGE,
> >>>>>> but it doesn't look like much, usually around 6-8:
> >>>>>> #mbean = org.apache.cassandra.concurrent:type=ROW-READ-STAGE:
> >>>>>> ActiveCount = 8;
> >>>>>> PendingTasks = 8;
> >>>>>> CompletedTasks = 5427955;
> >>>>>> Any help finding the solution is appreciated, thanks...
> >>>>>> Below are a few more JMXes I collected from the system that may be
> >>>>>> interesting.
> >>>>>> #mbean = java.lang:type=Memory:
> >>>>>> Verbose = false;
> >>>>>> HeapMemoryUsage = {
> >>>>>>   committed = 3767279616;
> >>>>>>   init = 134217728;
> >>>>>>   max = 4293656576;
> >>>>>>   used = 1237105080;
> >>>>>>  };
> >>>>>> NonHeapMemoryUsage = {
> >>>>>>   committed = 35061760;
> >>>>>>   init = 24313856;
> >>>>>>   max = 138412032;
> >>>>>>   used = 23151320;
> >>>>>>  };
> >>>>>> ObjectPendingFinalizationCount = 0;
> >>>>>> #mbean = java.lang:name=ParNew,type=GarbageCollector:
> >>>>>> LastGcInfo = {
> >>>>>>   GcThreadCount = 11;
> >>>>>>   duration = 136;
> >>>>>>   endTime = 42219272;
> >>>>>>   id = 11719;
> >>>>>>   memoryUsageAfterGc = {
> >>>>>>     ( CMS Perm Gen ) = {
> >>>>>>       key = CMS Perm Gen;
> >>>>>>       value = {
> >>>>>>         committed = 29229056;
> >>>>>>         init = 21757952;
> >>>>>>         max = 88080384;
> >>>>>>         used = 17648848;
> >>>>>>        };
> >>>>>>      };
> >>>>>>     ( Code Cache ) = {
> >>>>>>       key = Code Cache;
> >>>>>>       value = {
> >>>>>>         committed = 5832704;
> >>>>>>         init = 2555904;
> >>>>>>         max = 50331648;
> >>>>>>         used = 5563520;
> >>>>>>        };
> >>>>>>      };
> >>>>>>     ( CMS Old Gen ) = {
> >>>>>>       key = CMS Old Gen;
> >>>>>>       value = {
> >>>>>>         committed = 3594133504;
> >>>>>>         init = 112459776;
> >>>>>>         max = 4120510464;
> >>>>>>         used = 964565720;
> >>>>>>        };
> >>>>>>      };
> >>>>>>     ( Par Eden Space ) = {
> >>>>>>       key = Par Eden Space;
> >>>>>>       value = {
> >>>>>>         committed = 171835392;
> >>>>>>         init = 21495808;
> >>>>>>         max = 171835392;
> >>>>>>         used = 0;
> >>>>>>        };
> >>>>>>      };
> >>>>>>     ( Par Survivor Space ) = {
> >>>>>>       key = Par Survivor Space;
> >>>>>>       value = {
> >>>>>>         committed = 1310720;
> >>>>>>         init = 131072;
> >>>>>>         max = 1310720;
> >>>>>>         used = 0;
> >>>>>>        };
> >>>>>>      };
> >>>>>>    };
> >>>>>>   memoryUsageBeforeGc = {
> >>>>>>     ( CMS Perm Gen ) = {
> >>>>>>       key = CMS Perm Gen;
> >>>>>>       value = {
> >>>>>>         committed = 29229056;
> >>>>>>         init = 21757952;
> >>>>>>         max = 88080384;
> >>>>>>         used = 17648848;
> >>>>>>        };
> >>>>>>      };
> >>>>>>     ( Code Cache ) = {
> >>>>>>       key = Code Cache;
> >>>>>>       value = {
> >>>>>>         committed = 5832704;
> >>>>>>         init = 2555904;
> >>>>>>         max = 50331648;
> >>>>>>         used = 5563520;
> >>>>>>        };
> >>>>>>      };
> >>>>>>     ( CMS Old Gen ) = {
> >>>>>>       key = CMS Old Gen;
> >>>>>>       value = {
> >>>>>>         committed = 3594133504;
> >>>>>>         init = 112459776;
> >>>>>>         max = 4120510464;
> >>>>>>         used = 959221872;
> >>>>>>        };
> >>>>>>      };
> >>>>>>     ( Par Eden Space ) = {
> >>>>>>       key = Par Eden Space;
> >>>>>>       value = {
> >>>>>>         committed = 171835392;
> >>>>>>         init = 21495808;
> >>>>>>         max = 171835392;
> >>>>>>         used = 171835392;
> >>>>>>        };
> >>>>>>      };
> >>>>>>     ( Par Survivor Space ) = {
> >>>>>>       key = Par Survivor Space;
> >>>>>>       value = {
> >>>>>>         committed = 1310720;
> >>>>>>         init = 131072;
> >>>>>>         max = 1310720;
> >>>>>>         used = 0;
> >>>>>>        };
> >>>>>>      };
> >>>>>>    };
> >>>>>>   startTime = 42219136;
> >>>>>>  };
> >>>>>> CollectionCount = 11720;
> >>>>>> CollectionTime = 4561730;
> >>>>>> Name = ParNew;
> >>>>>> Valid = true;
> >>>>>> MemoryPoolNames = [ Par Eden Space, Par Survivor Space ];
> >>>>>> #mbean = java.lang:type=OperatingSystem:
> >>>>>> MaxFileDescriptorCount = 63536;
> >>>>>> OpenFileDescriptorCount = 75;
> >>>>>> CommittedVirtualMemorySize = 17787711488;
> >>>>>> FreePhysicalMemorySize = 45522944;
> >>>>>> FreeSwapSpaceSize = 2123968512;
> >>>>>> ProcessCpuTime = 12251460000000;
> >>>>>> TotalPhysicalMemorySize = 8364417024;
> >>>>>> TotalSwapSpaceSize = 4294959104;
> >>>>>> Name = Linux;
> >>>>>> AvailableProcessors = 8;
> >>>>>> Arch = amd64;
> >>>>>> SystemLoadAverage = 4.36;
> >>>>>> Version = 2.6.18-164.15.1.el5;
> >>>>>> #mbean = java.lang:type=Runtime:
> >>>>>> Name = 20281@ob1061.nydc1.outbrain.com;
> >>>>>>
> >>>>>> ClassPath =
> >>>>>>
> >>>>>>
> /outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../build/classes:/outbrain/cassandra/apache-cassandra-0.6.1/bin/..
> >>>>>>
> >>>>>>
> >>>>>>
> /lib/antlr-3.1.3.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/apache-cassandra-0.6.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/avro-1.2.0-dev.jar:/outb
> >>>>>>
> >>>>>>
> >>>>>>
> rain/cassandra/apache-cassandra-0.6.1/bin/../lib/clhm-production.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-cli-1.1.jar:/outbrain/cassandra/apache-cassandra-
> >>>>>>
> >>>>>>
> >>>>>>
> 0.6.1/bin/../lib/commons-codec-1.2.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-collections-3.2.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/com
> >>>>>>
> >>>>>>
> >>>>>>
> mons-lang-2.4.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/google-collections-1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/hadoop-core-0.20.1.jar:/out
> >>>>>>
> >>>>>>
> >>>>>>
> brain/cassandra/apache-cassandra-0.6.1/bin/../lib/high-scale-lib.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/ivy-2.1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/
> >>>>>>
> >>>>>>
> >>>>>>
> bin/../lib/jackson-core-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jackson-mapper-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jline
> >>>>>>
> >>>>>>
> >>>>>>
> -0.9.94.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/json-simple-1.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/libthrift-r917130.jar:/outbrain/cassandr
> >>>>>>
> >>>>>>
> >>>>>>
> a/apache-cassandra-0.6.1/bin/../lib/log4j-1.2.14.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/slf4j-api-1.5.8.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib
> >>>>>> /slf4j-log4j12-1.5.8.jar;
> >>>>>>
> >>>>>> BootClassPath =
> >>>>>>
> >>>>>>
> /usr/java/jdk1.6.0_17/jre/lib/alt-rt.jar:/usr/java/jdk1.6.0_17/jre/lib/resources.jar:/usr/java/jdk1.6.0_17/jre/lib/rt.jar:/usr/java/jdk1.6.0_17/jre/lib/sunrsasign.j
> >>>>>>
> >>>>>>
> >>>>>>
> ar:/usr/java/jdk1.6.0_17/jre/lib/jsse.jar:/usr/java/jdk1.6.0_17/jre/lib/jce.jar:/usr/java/jdk1.6.0_17/jre/lib/charsets.jar:/usr/java/jdk1.6.0_17/jre/classes;
> >>>>>>
> >>>>>> LibraryPath =
> >>>>>>
> >>>>>>
> /usr/java/jdk1.6.0_17/jre/lib/amd64/server:/usr/java/jdk1.6.0_17/jre/lib/amd64:/usr/java/jdk1.6.0_17/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib;
> >>>>>>
> >>>>>> VmName = Java HotSpot(TM) 64-Bit Server VM;
> >>>>>>
> >>>>>> VmVendor = Sun Microsystems Inc.;
> >>>>>>
> >>>>>> VmVersion = 14.3-b01;
> >>>>>>
> >>>>>> BootClassPathSupported = true;
> >>>>>>
> >>>>>> InputArguments = [ -ea, -Xms128M, -Xmx4G,
> -XX:TargetSurvivorRatio=90,
> >>>>>> -XX:+AggressiveOpts, -XX:+UseParNewGC, -XX:+UseConcMarkSweepGC,
> >>>>>> -XX:+CMSParallelRemarkEnabled, -XX:+HeapDumpOnOutOfMemoryError,
> >>>>>> -XX:SurvivorRatio=128, -XX:MaxTenuringThreshold=0,
> >>>>>> -Dcom.sun.management.jmxremote.port=9004,
> >>>>>> -Dcom.sun.management.jmxremote.ssl=false,
> >>>>>> -Dcom.sun.management.jmxremote.authenticate=false,
> >>>>>>
> >>>>>>
> -Dstorage-config=/outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf,
> >>>>>> -Dcassandra-pidfile=/var/run/cassandra.pid ];
> >>>>>>
> >>>>>> ManagementSpecVersion = 1.2;
> >>>>>>
> >>>>>> SpecName = Java Virtual Machine Specification;
> >>>>>>
> >>>>>> SpecVendor = Sun Microsystems Inc.;
> >>>>>>
> >>>>>> SpecVersion = 1.0;
> >>>>>>
> >>>>>> StartTime = 1272911001415;
> >>>>>> ...
> >>>>>
> >>>
> >>>
> >>
> >>
> >
> >
> >
> > --
> > Jonathan Ellis
> > Project Chair, Apache Cassandra
> > co-founder of Riptano, the source for professional Cassandra support
> > http://riptano.com
>
>

Re: performance tuning - where does the slowness come from?

Posted by Kyusik Chung <ky...@discovereads.com>.

Im using Ubuntu 8.04 on 64 bit hosts on rackspace cloud.

Im in the middle of repeating some perf tests, but so far, I get as-good or slightly better read perf by using standard disk access mode vs mmap.  So far consecutive tests are returning consistent numbers.

Im not sure how to explain it...maybe its an ubuntu 8.04 issue with mmap.  Back when I was using mmap, I was definitely seeing the kswapd0 process start using cpu as the box ran out of memory, and read performance significantly degraded.

Next, Ill run some tests with mmap_index_only, and Ill test with heavy concurrent writes as well as reads.  Ill let everyone know what I find.

Kyusik Chung
CEO, Discovereads.com
kyusik@discovereads.com

On May 4, 2010, at 12:27 PM, Jonathan Ellis wrote:

> Are you using 32 bit hosts?  If not don't be scared of mmap using a
> lot of address space, you have plenty.  It won't make you swap more
> than using buffered i/o.
> 
> On Tue, May 4, 2010 at 1:57 PM, Ran Tavory <ra...@gmail.com> wrote:
>> I canceled mmap and indeed memory usage is sane again. So far performance
>> hasn't been great, but I'll wait and see.
>> I'm also interested in a way to cap mmap so I can take advantage of it but
>> not swap the host to death...
>> 
>> On Tue, May 4, 2010 at 9:38 PM, Kyusik Chung <ky...@discovereads.com>
>> wrote:
>>> 
>>> This sounds just like the slowness I was asking about in another thread -
>>> after a lot of reads, the machine uses up all available memory on the box
>>> and then starts swapping.
>>> My understanding was that mmap helps greatly with read and write perf
>>> (until the box starts swapping I guess)...is there any way to use mmap and
>>> cap how much memory it takes up?
>>> What do people use in production?  mmap or no mmap?
>>> Thanks!
>>> Kyusik Chung
>>> On May 4, 2010, at 10:11 AM, Schubert Zhang wrote:
>>> 
>>> 1. When initially startup your nodes, please plan your InitialToken of
>>> each node evenly.
>>> 2. <DiskAccessMode>standard</DiskAccessMode>
>>> 
>>> On Tue, May 4, 2010 at 9:09 PM, Boris Shulman <sh...@gmail.com> wrote:
>>>> 
>>>> I think that the extra (more than 4GB) memory usage comes from the
>>>> mmaped io, that is why it happens only for reads.
>>>> 
>>>> On Tue, May 4, 2010 at 2:02 PM, Jordan Pittier <jo...@gmail.com>
>>>> wrote:
>>>>> I'm facing the same issue with swap. It only occurs when I perform read
>>>>> operations (write are very fast :)). So I can't help you with the
>>>>> memory
>>>>> probleme.
>>>>> 
>>>>> But to balance the load evenly between nodes in cluster just manually
>>>>> fix
>>>>> their token.(the "formula" is i * 2^127 / nb_nodes).
>>>>> 
>>>>> Jordzn
>>>>> 
>>>>> On Tue, May 4, 2010 at 8:20 AM, Ran Tavory <ra...@gmail.com> wrote:
>>>>>> 
>>>>>> I'm looking into performance issues on a 0.6.1 cluster. I see two
>>>>>> symptoms:
>>>>>> 1. Reads and writes are slow
>>>>>> 2. One of the hosts is doing a lot of GC.
>>>>>> 1 is slow in the sense that in normal state the cluster used to make
>>>>>> around 3-5k read and writes per second (6-10k operations per second),
>>>>>> but
>>>>>> how it's in the order of 200-400 ops per second, sometimes even less.
>>>>>> 2 looks like this:
>>>>>> $ tail -f /outbrain/cassandra/log/system.log
>>>>>>  INFO [GC inspection] 2010-05-04 00:42:18,636 GCInspector.java (line
>>>>>> 110)
>>>>>> GC for ParNew: 672 ms, 166482384 reclaimed leaving 2872087208 used;
>>>>>> max is
>>>>>> 4432068608
>>>>>>  INFO [GC inspection] 2010-05-04 00:42:28,638 GCInspector.java (line
>>>>>> 110)
>>>>>> GC for ParNew: 498 ms, 166493352 reclaimed leaving 2836049448 used;
>>>>>> max is
>>>>>> 4432068608
>>>>>>  INFO [GC inspection] 2010-05-04 00:42:38,640 GCInspector.java (line
>>>>>> 110)
>>>>>> GC for ParNew: 327 ms, 166091528 reclaimed leaving 2796888424 used;
>>>>>> max is
>>>>>> 4432068608
>>>>>> ... and it goes on and on for hours, no stopping...
>>>>>> The cluster is made of 6 hosts, 3 in one DC and 3 in another.
>>>>>> Each host has 8G RAM.
>>>>>> -Xmx=4G
>>>>>> For some reason, the load isn't distributed evenly b/w the hosts,
>>>>>> although
>>>>>> I'm not sure this is the cause for slowness
>>>>>> $ nodetool -h localhost -p 9004 ring
>>>>>> Address       Status     Load          Range
>>>>>>        Ring
>>>>>> 
>>>>>> 144413773383729447702215082383444206680
>>>>>> 192.168.252.99Up         15.94 GB
>>>>>>  66002764663998929243644931915471302076     |<--|
>>>>>> 192.168.254.57Up         19.84 GB
>>>>>>  81288739225600737067856268063987022738     |   ^
>>>>>> 192.168.254.58Up         973.78 MB
>>>>>> 86999744104066390588161689990810839743     v   |
>>>>>> 192.168.252.62Up         5.18 GB
>>>>>> 88308919879653155454332084719458267849     |   ^
>>>>>> 192.168.254.59Up         10.57 GB
>>>>>>  142482163220375328195837946953175033937    v   |
>>>>>> 192.168.252.61Up         11.36 GB
>>>>>>  144413773383729447702215082383444206680    |-->|
>>>>>> The slow host is 192.168.252.61 and it isn't the most loaded one.
>>>>>> The host is waiting a lot on IO and the load average is usually 6-7
>>>>>> $ w
>>>>>>  00:42:56 up 11 days, 13:22,  1 user,  load average: 6.21, 5.52, 3.93
>>>>>> $ vmstat 5
>>>>>> procs -----------memory---------- ---swap-- -----io---- --system--
>>>>>> -----cpu------
>>>>>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us
>>>>>> sy id
>>>>>> wa st
>>>>>>  0  8 2147844  45744   1816 4457384    6    5    66    32    5    2  1
>>>>>>  1
>>>>>> 96  2  0
>>>>>>  0  8 2147164  49020   1808 4451596  385    0  2345    58 3372 9957  2
>>>>>>  2
>>>>>> 78 18  0
>>>>>>  0  3 2146432  45704   1812 4453956  342    0  2274   108 3937 10732
>>>>>>  2  2
>>>>>> 78 19  0
>>>>>>  0  1 2146252  44696   1804 4453436  345  164  1939   294 3647 7833  2
>>>>>>  2
>>>>>> 78 18  0
>>>>>>  0  1 2145960  46924   1744 4451260  158    0  2423   122 4354 14597
>>>>>>  2  2
>>>>>> 77 18  0
>>>>>>  7  1 2138344  44676    952 4504148 1722  403  1722   406 1388  439 87
>>>>>>  0
>>>>>> 10  2  0
>>>>>>  7  2 2137248  45652    956 4499436 1384  655  1384   658 1356  392 87
>>>>>>  0
>>>>>> 10  3  0
>>>>>>  7  1 2135976  46764    956 4495020 1366  718  1366   718 1395  380 87
>>>>>>  0
>>>>>>  9  4  0
>>>>>>  0  8 2134484  46964    956 4489420 1673  555  1814   586 1601 215590
>>>>>> 14
>>>>>>  2 68 16  0
>>>>>>  0  1 2135388  47444    972 4488516  785  833  2390   995 3812 8305  2
>>>>>>  2
>>>>>> 77 20  0
>>>>>>  0 10 2135164  45928    980 4488796  788  543  2275   626 36
>>>>>> So, the host is swapping like crazy...
>>>>>> top shows that it's using a lot of memory. As noted before -Xmx=4G and
>>>>>> nothing else seems to be using a lot of memory on the host except for
>>>>>> the
>>>>>> cassandra process, however, of the 8G ram on the host, 92% is used by
>>>>>> cassandra. How's that?
>>>>>> Top shows there's 3.9g Shared and 7.2g Resident and 15.9g Virtual. Why
>>>>>> does it have 15g virtual? And why 7.2 RES? This can explain the
>>>>>> slowness in
>>>>>> swapping.
>>>>>> $ top
>>>>>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>>>>> 
>>>>>> 
>>>>>> 20281 cassandr  25   0 15.9g 7.2g 3.9g S 33.3 92.6 175:30.27 java
>>>>>> So, can the total memory be controlled?
>>>>>> Or perhaps I'm looking in the wrong direction...
>>>>>> I've looked at all the cassandra JMX counts and nothing seemed
>>>>>> suspicious
>>>>>> so far. By suspicious i mean a large number of pending tasks - there
>>>>>> were
>>>>>> always very small numbers in each pool.
>>>>>> About read and write latencies, I'm not sure what the normal state is,
>>>>>> but
>>>>>> here's an example of what I see on the problematic host:
>>>>>> #mbean = org.apache.cassandra.service:type=StorageProxy:
>>>>>> RecentReadLatencyMicros = 30105.888180684495;
>>>>>> TotalReadLatencyMicros = 78543052801;
>>>>>> TotalWriteLatencyMicros = 4213118609;
>>>>>> RecentWriteLatencyMicros = 1444.4809201925639;
>>>>>> ReadOperations = 4779553;
>>>>>> RangeOperations = 0;
>>>>>> TotalRangeLatencyMicros = 0;
>>>>>> RecentRangeLatencyMicros = NaN;
>>>>>> WriteOperations = 4740093;
>>>>>> And the only pool that I do see some pending tasks is the
>>>>>> ROW-READ-STAGE,
>>>>>> but it doesn't look like much, usually around 6-8:
>>>>>> #mbean = org.apache.cassandra.concurrent:type=ROW-READ-STAGE:
>>>>>> ActiveCount = 8;
>>>>>> PendingTasks = 8;
>>>>>> CompletedTasks = 5427955;
>>>>>> Any help finding the solution is appreciated, thanks...
>>>>>> Below are a few more JMXes I collected from the system that may be
>>>>>> interesting.
>>>>>> #mbean = java.lang:type=Memory:
>>>>>> Verbose = false;
>>>>>> HeapMemoryUsage = {
>>>>>>   committed = 3767279616;
>>>>>>   init = 134217728;
>>>>>>   max = 4293656576;
>>>>>>   used = 1237105080;
>>>>>>  };
>>>>>> NonHeapMemoryUsage = {
>>>>>>   committed = 35061760;
>>>>>>   init = 24313856;
>>>>>>   max = 138412032;
>>>>>>   used = 23151320;
>>>>>>  };
>>>>>> ObjectPendingFinalizationCount = 0;
>>>>>> #mbean = java.lang:name=ParNew,type=GarbageCollector:
>>>>>> LastGcInfo = {
>>>>>>   GcThreadCount = 11;
>>>>>>   duration = 136;
>>>>>>   endTime = 42219272;
>>>>>>   id = 11719;
>>>>>>   memoryUsageAfterGc = {
>>>>>>     ( CMS Perm Gen ) = {
>>>>>>       key = CMS Perm Gen;
>>>>>>       value = {
>>>>>>         committed = 29229056;
>>>>>>         init = 21757952;
>>>>>>         max = 88080384;
>>>>>>         used = 17648848;
>>>>>>        };
>>>>>>      };
>>>>>>     ( Code Cache ) = {
>>>>>>       key = Code Cache;
>>>>>>       value = {
>>>>>>         committed = 5832704;
>>>>>>         init = 2555904;
>>>>>>         max = 50331648;
>>>>>>         used = 5563520;
>>>>>>        };
>>>>>>      };
>>>>>>     ( CMS Old Gen ) = {
>>>>>>       key = CMS Old Gen;
>>>>>>       value = {
>>>>>>         committed = 3594133504;
>>>>>>         init = 112459776;
>>>>>>         max = 4120510464;
>>>>>>         used = 964565720;
>>>>>>        };
>>>>>>      };
>>>>>>     ( Par Eden Space ) = {
>>>>>>       key = Par Eden Space;
>>>>>>       value = {
>>>>>>         committed = 171835392;
>>>>>>         init = 21495808;
>>>>>>         max = 171835392;
>>>>>>         used = 0;
>>>>>>        };
>>>>>>      };
>>>>>>     ( Par Survivor Space ) = {
>>>>>>       key = Par Survivor Space;
>>>>>>       value = {
>>>>>>         committed = 1310720;
>>>>>>         init = 131072;
>>>>>>         max = 1310720;
>>>>>>         used = 0;
>>>>>>        };
>>>>>>      };
>>>>>>    };
>>>>>>   memoryUsageBeforeGc = {
>>>>>>     ( CMS Perm Gen ) = {
>>>>>>       key = CMS Perm Gen;
>>>>>>       value = {
>>>>>>         committed = 29229056;
>>>>>>         init = 21757952;
>>>>>>         max = 88080384;
>>>>>>         used = 17648848;
>>>>>>        };
>>>>>>      };
>>>>>>     ( Code Cache ) = {
>>>>>>       key = Code Cache;
>>>>>>       value = {
>>>>>>         committed = 5832704;
>>>>>>         init = 2555904;
>>>>>>         max = 50331648;
>>>>>>         used = 5563520;
>>>>>>        };
>>>>>>      };
>>>>>>     ( CMS Old Gen ) = {
>>>>>>       key = CMS Old Gen;
>>>>>>       value = {
>>>>>>         committed = 3594133504;
>>>>>>         init = 112459776;
>>>>>>         max = 4120510464;
>>>>>>         used = 959221872;
>>>>>>        };
>>>>>>      };
>>>>>>     ( Par Eden Space ) = {
>>>>>>       key = Par Eden Space;
>>>>>>       value = {
>>>>>>         committed = 171835392;
>>>>>>         init = 21495808;
>>>>>>         max = 171835392;
>>>>>>         used = 171835392;
>>>>>>        };
>>>>>>      };
>>>>>>     ( Par Survivor Space ) = {
>>>>>>       key = Par Survivor Space;
>>>>>>       value = {
>>>>>>         committed = 1310720;
>>>>>>         init = 131072;
>>>>>>         max = 1310720;
>>>>>>         used = 0;
>>>>>>        };
>>>>>>      };
>>>>>>    };
>>>>>>   startTime = 42219136;
>>>>>>  };
>>>>>> CollectionCount = 11720;
>>>>>> CollectionTime = 4561730;
>>>>>> Name = ParNew;
>>>>>> Valid = true;
>>>>>> MemoryPoolNames = [ Par Eden Space, Par Survivor Space ];
>>>>>> #mbean = java.lang:type=OperatingSystem:
>>>>>> MaxFileDescriptorCount = 63536;
>>>>>> OpenFileDescriptorCount = 75;
>>>>>> CommittedVirtualMemorySize = 17787711488;
>>>>>> FreePhysicalMemorySize = 45522944;
>>>>>> FreeSwapSpaceSize = 2123968512;
>>>>>> ProcessCpuTime = 12251460000000;
>>>>>> TotalPhysicalMemorySize = 8364417024;
>>>>>> TotalSwapSpaceSize = 4294959104;
>>>>>> Name = Linux;
>>>>>> AvailableProcessors = 8;
>>>>>> Arch = amd64;
>>>>>> SystemLoadAverage = 4.36;
>>>>>> Version = 2.6.18-164.15.1.el5;
>>>>>> #mbean = java.lang:type=Runtime:
>>>>>> Name = 20281@ob1061.nydc1.outbrain.com;
>>>>>> 
>>>>>> ClassPath =
>>>>>> 
>>>>>> /outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../build/classes:/outbrain/cassandra/apache-cassandra-0.6.1/bin/..
>>>>>> 
>>>>>> 
>>>>>> /lib/antlr-3.1.3.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/apache-cassandra-0.6.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/avro-1.2.0-dev.jar:/outb
>>>>>> 
>>>>>> 
>>>>>> rain/cassandra/apache-cassandra-0.6.1/bin/../lib/clhm-production.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-cli-1.1.jar:/outbrain/cassandra/apache-cassandra-
>>>>>> 
>>>>>> 
>>>>>> 0.6.1/bin/../lib/commons-codec-1.2.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-collections-3.2.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/com
>>>>>> 
>>>>>> 
>>>>>> mons-lang-2.4.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/google-collections-1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/hadoop-core-0.20.1.jar:/out
>>>>>> 
>>>>>> 
>>>>>> brain/cassandra/apache-cassandra-0.6.1/bin/../lib/high-scale-lib.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/ivy-2.1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/
>>>>>> 
>>>>>> 
>>>>>> bin/../lib/jackson-core-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jackson-mapper-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jline
>>>>>> 
>>>>>> 
>>>>>> -0.9.94.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/json-simple-1.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/libthrift-r917130.jar:/outbrain/cassandr
>>>>>> 
>>>>>> 
>>>>>> a/apache-cassandra-0.6.1/bin/../lib/log4j-1.2.14.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/slf4j-api-1.5.8.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib
>>>>>> /slf4j-log4j12-1.5.8.jar;
>>>>>> 
>>>>>> BootClassPath =
>>>>>> 
>>>>>> /usr/java/jdk1.6.0_17/jre/lib/alt-rt.jar:/usr/java/jdk1.6.0_17/jre/lib/resources.jar:/usr/java/jdk1.6.0_17/jre/lib/rt.jar:/usr/java/jdk1.6.0_17/jre/lib/sunrsasign.j
>>>>>> 
>>>>>> 
>>>>>> ar:/usr/java/jdk1.6.0_17/jre/lib/jsse.jar:/usr/java/jdk1.6.0_17/jre/lib/jce.jar:/usr/java/jdk1.6.0_17/jre/lib/charsets.jar:/usr/java/jdk1.6.0_17/jre/classes;
>>>>>> 
>>>>>> LibraryPath =
>>>>>> 
>>>>>> /usr/java/jdk1.6.0_17/jre/lib/amd64/server:/usr/java/jdk1.6.0_17/jre/lib/amd64:/usr/java/jdk1.6.0_17/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib;
>>>>>> 
>>>>>> VmName = Java HotSpot(TM) 64-Bit Server VM;
>>>>>> 
>>>>>> VmVendor = Sun Microsystems Inc.;
>>>>>> 
>>>>>> VmVersion = 14.3-b01;
>>>>>> 
>>>>>> BootClassPathSupported = true;
>>>>>> 
>>>>>> InputArguments = [ -ea, -Xms128M, -Xmx4G, -XX:TargetSurvivorRatio=90,
>>>>>> -XX:+AggressiveOpts, -XX:+UseParNewGC, -XX:+UseConcMarkSweepGC,
>>>>>> -XX:+CMSParallelRemarkEnabled, -XX:+HeapDumpOnOutOfMemoryError,
>>>>>> -XX:SurvivorRatio=128, -XX:MaxTenuringThreshold=0,
>>>>>> -Dcom.sun.management.jmxremote.port=9004,
>>>>>> -Dcom.sun.management.jmxremote.ssl=false,
>>>>>> -Dcom.sun.management.jmxremote.authenticate=false,
>>>>>> 
>>>>>> -Dstorage-config=/outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf,
>>>>>> -Dcassandra-pidfile=/var/run/cassandra.pid ];
>>>>>> 
>>>>>> ManagementSpecVersion = 1.2;
>>>>>> 
>>>>>> SpecName = Java Virtual Machine Specification;
>>>>>> 
>>>>>> SpecVendor = Sun Microsystems Inc.;
>>>>>> 
>>>>>> SpecVersion = 1.0;
>>>>>> 
>>>>>> StartTime = 1272911001415;
>>>>>> ...
>>>>> 
>>> 
>>> 
>> 
>> 
> 
> 
> 
> -- 
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com

Re: performance tuning - where does the slowness come from?

Posted by Ran Tavory <ra...@gmail.com>.

it's a 64bit host.
when I cancel mmap I see less memory used and zero swapping, but it's slowly
growing so I'll have to wait and see.
Performance isn't much better, not sure what's the bottleneck now (could
also be the application).

Now on the same host I see:
top - 15:43:59 up 12 days,  4:23,  1 user,  load average: 0.29, 0.68, 1.53
Tasks: 152 total,   1 running, 151 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.1%us,  0.5%sy,  0.0%ni, 97.8%id,  0.3%wa,  0.0%hi,  0.2%si,
 0.0%st
Mem:   8168376k total,  8120364k used,    48012k free,     2540k buffers
Swap:  4194296k total,    12816k used,  4181480k free,  5028672k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  SWAP nFLT
COMMAND

25122 cassandr  22   0 4943m 2.9g   9m S 12.6 36.7  35:39.53 2.0g  141 java


$ vmstat 5
procs -----------memory---------- ---swap-- -----io---- --system--
-----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id
wa st
 1  0  12816  46656   2664 5021340    8    6    79    34    3    1  1  1 95
 3  0
 0  0  12816  48180   2672 5019460    0    0   282     9 1913 2450  2  1 97
 0  0
 0  0  12816  45064   2688 5020688    0    0   282    83 1850 2303  1  1 97
 0  0
 0  0  12816  47612   2696 5017520    0    0   102    59 1884 2328  1  1 98
 0  0


On Tue, May 4, 2010 at 10:27 PM, Jonathan Ellis <jb...@gmail.com> wrote:

> Are you using 32 bit hosts?  If not don't be scared of mmap using a
> lot of address space, you have plenty.  It won't make you swap more
> than using buffered i/o.
>
> On Tue, May 4, 2010 at 1:57 PM, Ran Tavory <ra...@gmail.com> wrote:
> > I canceled mmap and indeed memory usage is sane again. So far performance
> > hasn't been great, but I'll wait and see.
> > I'm also interested in a way to cap mmap so I can take advantage of it
> but
> > not swap the host to death...
> >
> > On Tue, May 4, 2010 at 9:38 PM, Kyusik Chung <ky...@discovereads.com>
> > wrote:
> >>
> >> This sounds just like the slowness I was asking about in another thread
> -
> >> after a lot of reads, the machine uses up all available memory on the
> box
> >> and then starts swapping.
> >> My understanding was that mmap helps greatly with read and write perf
> >> (until the box starts swapping I guess)...is there any way to use mmap
> and
> >> cap how much memory it takes up?
> >> What do people use in production?  mmap or no mmap?
> >> Thanks!
> >> Kyusik Chung
> >> On May 4, 2010, at 10:11 AM, Schubert Zhang wrote:
> >>
> >> 1. When initially startup your nodes, please plan your InitialToken of
> >> each node evenly.
> >> 2. <DiskAccessMode>standard</DiskAccessMode>
> >>
> >> On Tue, May 4, 2010 at 9:09 PM, Boris Shulman <sh...@gmail.com>
> wrote:
> >>>
> >>> I think that the extra (more than 4GB) memory usage comes from the
> >>> mmaped io, that is why it happens only for reads.
> >>>
> >>> On Tue, May 4, 2010 at 2:02 PM, Jordan Pittier <
> jordan.pittier@gmail.com>
> >>> wrote:
> >>> > I'm facing the same issue with swap. It only occurs when I perform
> read
> >>> > operations (write are very fast :)). So I can't help you with the
> >>> > memory
> >>> > probleme.
> >>> >
> >>> > But to balance the load evenly between nodes in cluster just manually
> >>> > fix
> >>> > their token.(the "formula" is i * 2^127 / nb_nodes).
> >>> >
> >>> > Jordzn
> >>> >
> >>> > On Tue, May 4, 2010 at 8:20 AM, Ran Tavory <ra...@gmail.com> wrote:
> >>> >>
> >>> >> I'm looking into performance issues on a 0.6.1 cluster. I see two
> >>> >> symptoms:
> >>> >> 1. Reads and writes are slow
> >>> >> 2. One of the hosts is doing a lot of GC.
> >>> >> 1 is slow in the sense that in normal state the cluster used to make
> >>> >> around 3-5k read and writes per second (6-10k operations per
> second),
> >>> >> but
> >>> >> how it's in the order of 200-400 ops per second, sometimes even
> less.
> >>> >> 2 looks like this:
> >>> >> $ tail -f /outbrain/cassandra/log/system.log
> >>> >>  INFO [GC inspection] 2010-05-04 00:42:18,636 GCInspector.java (line
> >>> >> 110)
> >>> >> GC for ParNew: 672 ms, 166482384 reclaimed leaving 2872087208 used;
> >>> >> max is
> >>> >> 4432068608
> >>> >>  INFO [GC inspection] 2010-05-04 00:42:28,638 GCInspector.java (line
> >>> >> 110)
> >>> >> GC for ParNew: 498 ms, 166493352 reclaimed leaving 2836049448 used;
> >>> >> max is
> >>> >> 4432068608
> >>> >>  INFO [GC inspection] 2010-05-04 00:42:38,640 GCInspector.java (line
> >>> >> 110)
> >>> >> GC for ParNew: 327 ms, 166091528 reclaimed leaving 2796888424 used;
> >>> >> max is
> >>> >> 4432068608
> >>> >> ... and it goes on and on for hours, no stopping...
> >>> >> The cluster is made of 6 hosts, 3 in one DC and 3 in another.
> >>> >> Each host has 8G RAM.
> >>> >> -Xmx=4G
> >>> >> For some reason, the load isn't distributed evenly b/w the hosts,
> >>> >> although
> >>> >> I'm not sure this is the cause for slowness
> >>> >> $ nodetool -h localhost -p 9004 ring
> >>> >> Address       Status     Load          Range
> >>> >>        Ring
> >>> >>
> >>> >> 144413773383729447702215082383444206680
> >>> >> 192.168.252.99Up         15.94 GB
> >>> >>  66002764663998929243644931915471302076     |<--|
> >>> >> 192.168.254.57Up         19.84 GB
> >>> >>  81288739225600737067856268063987022738     |   ^
> >>> >> 192.168.254.58Up         973.78 MB
> >>> >> 86999744104066390588161689990810839743     v   |
> >>> >> 192.168.252.62Up         5.18 GB
> >>> >> 88308919879653155454332084719458267849     |   ^
> >>> >> 192.168.254.59Up         10.57 GB
> >>> >>  142482163220375328195837946953175033937    v   |
> >>> >> 192.168.252.61Up         11.36 GB
> >>> >>  144413773383729447702215082383444206680    |-->|
> >>> >> The slow host is 192.168.252.61 and it isn't the most loaded one.
> >>> >> The host is waiting a lot on IO and the load average is usually 6-7
> >>> >> $ w
> >>> >>  00:42:56 up 11 days, 13:22,  1 user,  load average: 6.21, 5.52,
> 3.93
> >>> >> $ vmstat 5
> >>> >> procs -----------memory---------- ---swap-- -----io---- --system--
> >>> >> -----cpu------
> >>> >>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us
> >>> >> sy id
> >>> >> wa st
> >>> >>  0  8 2147844  45744   1816 4457384    6    5    66    32    5    2
>  1
> >>> >>  1
> >>> >> 96  2  0
> >>> >>  0  8 2147164  49020   1808 4451596  385    0  2345    58 3372 9957
>  2
> >>> >>  2
> >>> >> 78 18  0
> >>> >>  0  3 2146432  45704   1812 4453956  342    0  2274   108 3937 10732
> >>> >>  2  2
> >>> >> 78 19  0
> >>> >>  0  1 2146252  44696   1804 4453436  345  164  1939   294 3647 7833
>  2
> >>> >>  2
> >>> >> 78 18  0
> >>> >>  0  1 2145960  46924   1744 4451260  158    0  2423   122 4354 14597
> >>> >>  2  2
> >>> >> 77 18  0
> >>> >>  7  1 2138344  44676    952 4504148 1722  403  1722   406 1388  439
> 87
> >>> >>  0
> >>> >> 10  2  0
> >>> >>  7  2 2137248  45652    956 4499436 1384  655  1384   658 1356  392
> 87
> >>> >>  0
> >>> >> 10  3  0
> >>> >>  7  1 2135976  46764    956 4495020 1366  718  1366   718 1395  380
> 87
> >>> >>  0
> >>> >>  9  4  0
> >>> >>  0  8 2134484  46964    956 4489420 1673  555  1814   586 1601
> 215590
> >>> >> 14
> >>> >>  2 68 16  0
> >>> >>  0  1 2135388  47444    972 4488516  785  833  2390   995 3812 8305
>  2
> >>> >>  2
> >>> >> 77 20  0
> >>> >>  0 10 2135164  45928    980 4488796  788  543  2275   626 36
> >>> >> So, the host is swapping like crazy...
> >>> >> top shows that it's using a lot of memory. As noted before -Xmx=4G
> and
> >>> >> nothing else seems to be using a lot of memory on the host except
> for
> >>> >> the
> >>> >> cassandra process, however, of the 8G ram on the host, 92% is used
> by
> >>> >> cassandra. How's that?
> >>> >> Top shows there's 3.9g Shared and 7.2g Resident and 15.9g Virtual.
> Why
> >>> >> does it have 15g virtual? And why 7.2 RES? This can explain the
> >>> >> slowness in
> >>> >> swapping.
> >>> >> $ top
> >>> >>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> >>> >>
> >>> >>
> >>> >> 20281 cassandr  25   0 15.9g 7.2g 3.9g S 33.3 92.6 175:30.27 java
> >>> >> So, can the total memory be controlled?
> >>> >> Or perhaps I'm looking in the wrong direction...
> >>> >> I've looked at all the cassandra JMX counts and nothing seemed
> >>> >> suspicious
> >>> >> so far. By suspicious i mean a large number of pending tasks - there
> >>> >> were
> >>> >> always very small numbers in each pool.
> >>> >> About read and write latencies, I'm not sure what the normal state
> is,
> >>> >> but
> >>> >> here's an example of what I see on the problematic host:
> >>> >> #mbean = org.apache.cassandra.service:type=StorageProxy:
> >>> >> RecentReadLatencyMicros = 30105.888180684495;
> >>> >> TotalReadLatencyMicros = 78543052801;
> >>> >> TotalWriteLatencyMicros = 4213118609;
> >>> >> RecentWriteLatencyMicros = 1444.4809201925639;
> >>> >> ReadOperations = 4779553;
> >>> >> RangeOperations = 0;
> >>> >> TotalRangeLatencyMicros = 0;
> >>> >> RecentRangeLatencyMicros = NaN;
> >>> >> WriteOperations = 4740093;
> >>> >> And the only pool that I do see some pending tasks is the
> >>> >> ROW-READ-STAGE,
> >>> >> but it doesn't look like much, usually around 6-8:
> >>> >> #mbean = org.apache.cassandra.concurrent:type=ROW-READ-STAGE:
> >>> >> ActiveCount = 8;
> >>> >> PendingTasks = 8;
> >>> >> CompletedTasks = 5427955;
> >>> >> Any help finding the solution is appreciated, thanks...
> >>> >> Below are a few more JMXes I collected from the system that may be
> >>> >> interesting.
> >>> >> #mbean = java.lang:type=Memory:
> >>> >> Verbose = false;
> >>> >> HeapMemoryUsage = {
> >>> >>   committed = 3767279616;
> >>> >>   init = 134217728;
> >>> >>   max = 4293656576;
> >>> >>   used = 1237105080;
> >>> >>  };
> >>> >> NonHeapMemoryUsage = {
> >>> >>   committed = 35061760;
> >>> >>   init = 24313856;
> >>> >>   max = 138412032;
> >>> >>   used = 23151320;
> >>> >>  };
> >>> >> ObjectPendingFinalizationCount = 0;
> >>> >> #mbean = java.lang:name=ParNew,type=GarbageCollector:
> >>> >> LastGcInfo = {
> >>> >>   GcThreadCount = 11;
> >>> >>   duration = 136;
> >>> >>   endTime = 42219272;
> >>> >>   id = 11719;
> >>> >>   memoryUsageAfterGc = {
> >>> >>     ( CMS Perm Gen ) = {
> >>> >>       key = CMS Perm Gen;
> >>> >>       value = {
> >>> >>         committed = 29229056;
> >>> >>         init = 21757952;
> >>> >>         max = 88080384;
> >>> >>         used = 17648848;
> >>> >>        };
> >>> >>      };
> >>> >>     ( Code Cache ) = {
> >>> >>       key = Code Cache;
> >>> >>       value = {
> >>> >>         committed = 5832704;
> >>> >>         init = 2555904;
> >>> >>         max = 50331648;
> >>> >>         used = 5563520;
> >>> >>        };
> >>> >>      };
> >>> >>     ( CMS Old Gen ) = {
> >>> >>       key = CMS Old Gen;
> >>> >>       value = {
> >>> >>         committed = 3594133504;
> >>> >>         init = 112459776;
> >>> >>         max = 4120510464;
> >>> >>         used = 964565720;
> >>> >>        };
> >>> >>      };
> >>> >>     ( Par Eden Space ) = {
> >>> >>       key = Par Eden Space;
> >>> >>       value = {
> >>> >>         committed = 171835392;
> >>> >>         init = 21495808;
> >>> >>         max = 171835392;
> >>> >>         used = 0;
> >>> >>        };
> >>> >>      };
> >>> >>     ( Par Survivor Space ) = {
> >>> >>       key = Par Survivor Space;
> >>> >>       value = {
> >>> >>         committed = 1310720;
> >>> >>         init = 131072;
> >>> >>         max = 1310720;
> >>> >>         used = 0;
> >>> >>        };
> >>> >>      };
> >>> >>    };
> >>> >>   memoryUsageBeforeGc = {
> >>> >>     ( CMS Perm Gen ) = {
> >>> >>       key = CMS Perm Gen;
> >>> >>       value = {
> >>> >>         committed = 29229056;
> >>> >>         init = 21757952;
> >>> >>         max = 88080384;
> >>> >>         used = 17648848;
> >>> >>        };
> >>> >>      };
> >>> >>     ( Code Cache ) = {
> >>> >>       key = Code Cache;
> >>> >>       value = {
> >>> >>         committed = 5832704;
> >>> >>         init = 2555904;
> >>> >>         max = 50331648;
> >>> >>         used = 5563520;
> >>> >>        };
> >>> >>      };
> >>> >>     ( CMS Old Gen ) = {
> >>> >>       key = CMS Old Gen;
> >>> >>       value = {
> >>> >>         committed = 3594133504;
> >>> >>         init = 112459776;
> >>> >>         max = 4120510464;
> >>> >>         used = 959221872;
> >>> >>        };
> >>> >>      };
> >>> >>     ( Par Eden Space ) = {
> >>> >>       key = Par Eden Space;
> >>> >>       value = {
> >>> >>         committed = 171835392;
> >>> >>         init = 21495808;
> >>> >>         max = 171835392;
> >>> >>         used = 171835392;
> >>> >>        };
> >>> >>      };
> >>> >>     ( Par Survivor Space ) = {
> >>> >>       key = Par Survivor Space;
> >>> >>       value = {
> >>> >>         committed = 1310720;
> >>> >>         init = 131072;
> >>> >>         max = 1310720;
> >>> >>         used = 0;
> >>> >>        };
> >>> >>      };
> >>> >>    };
> >>> >>   startTime = 42219136;
> >>> >>  };
> >>> >> CollectionCount = 11720;
> >>> >> CollectionTime = 4561730;
> >>> >> Name = ParNew;
> >>> >> Valid = true;
> >>> >> MemoryPoolNames = [ Par Eden Space, Par Survivor Space ];
> >>> >> #mbean = java.lang:type=OperatingSystem:
> >>> >> MaxFileDescriptorCount = 63536;
> >>> >> OpenFileDescriptorCount = 75;
> >>> >> CommittedVirtualMemorySize = 17787711488;
> >>> >> FreePhysicalMemorySize = 45522944;
> >>> >> FreeSwapSpaceSize = 2123968512;
> >>> >> ProcessCpuTime = 12251460000000;
> >>> >> TotalPhysicalMemorySize = 8364417024;
> >>> >> TotalSwapSpaceSize = 4294959104;
> >>> >> Name = Linux;
> >>> >> AvailableProcessors = 8;
> >>> >> Arch = amd64;
> >>> >> SystemLoadAverage = 4.36;
> >>> >> Version = 2.6.18-164.15.1.el5;
> >>> >> #mbean = java.lang:type=Runtime:
> >>> >> Name = 20281@ob1061.nydc1.outbrain.com;
> >>> >>
> >>> >> ClassPath =
> >>> >>
> >>> >>
> /outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../build/classes:/outbrain/cassandra/apache-cassandra-0.6.1/bin/..
> >>> >>
> >>> >>
> >>> >>
> /lib/antlr-3.1.3.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/apache-cassandra-0.6.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/avro-1.2.0-dev.jar:/outb
> >>> >>
> >>> >>
> >>> >>
> rain/cassandra/apache-cassandra-0.6.1/bin/../lib/clhm-production.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-cli-1.1.jar:/outbrain/cassandra/apache-cassandra-
> >>> >>
> >>> >>
> >>> >>
> 0.6.1/bin/../lib/commons-codec-1.2.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-collections-3.2.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/com
> >>> >>
> >>> >>
> >>> >>
> mons-lang-2.4.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/google-collections-1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/hadoop-core-0.20.1.jar:/out
> >>> >>
> >>> >>
> >>> >>
> brain/cassandra/apache-cassandra-0.6.1/bin/../lib/high-scale-lib.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/ivy-2.1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/
> >>> >>
> >>> >>
> >>> >>
> bin/../lib/jackson-core-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jackson-mapper-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jline
> >>> >>
> >>> >>
> >>> >>
> -0.9.94.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/json-simple-1.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/libthrift-r917130.jar:/outbrain/cassandr
> >>> >>
> >>> >>
> >>> >>
> a/apache-cassandra-0.6.1/bin/../lib/log4j-1.2.14.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/slf4j-api-1.5.8.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib
> >>> >> /slf4j-log4j12-1.5.8.jar;
> >>> >>
> >>> >> BootClassPath =
> >>> >>
> >>> >>
> /usr/java/jdk1.6.0_17/jre/lib/alt-rt.jar:/usr/java/jdk1.6.0_17/jre/lib/resources.jar:/usr/java/jdk1.6.0_17/jre/lib/rt.jar:/usr/java/jdk1.6.0_17/jre/lib/sunrsasign.j
> >>> >>
> >>> >>
> >>> >>
> ar:/usr/java/jdk1.6.0_17/jre/lib/jsse.jar:/usr/java/jdk1.6.0_17/jre/lib/jce.jar:/usr/java/jdk1.6.0_17/jre/lib/charsets.jar:/usr/java/jdk1.6.0_17/jre/classes;
> >>> >>
> >>> >> LibraryPath =
> >>> >>
> >>> >>
> /usr/java/jdk1.6.0_17/jre/lib/amd64/server:/usr/java/jdk1.6.0_17/jre/lib/amd64:/usr/java/jdk1.6.0_17/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib;
> >>> >>
> >>> >> VmName = Java HotSpot(TM) 64-Bit Server VM;
> >>> >>
> >>> >> VmVendor = Sun Microsystems Inc.;
> >>> >>
> >>> >> VmVersion = 14.3-b01;
> >>> >>
> >>> >> BootClassPathSupported = true;
> >>> >>
> >>> >> InputArguments = [ -ea, -Xms128M, -Xmx4G,
> -XX:TargetSurvivorRatio=90,
> >>> >> -XX:+AggressiveOpts, -XX:+UseParNewGC, -XX:+UseConcMarkSweepGC,
> >>> >> -XX:+CMSParallelRemarkEnabled, -XX:+HeapDumpOnOutOfMemoryError,
> >>> >> -XX:SurvivorRatio=128, -XX:MaxTenuringThreshold=0,
> >>> >> -Dcom.sun.management.jmxremote.port=9004,
> >>> >> -Dcom.sun.management.jmxremote.ssl=false,
> >>> >> -Dcom.sun.management.jmxremote.authenticate=false,
> >>> >>
> >>> >>
> -Dstorage-config=/outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf,
> >>> >> -Dcassandra-pidfile=/var/run/cassandra.pid ];
> >>> >>
> >>> >> ManagementSpecVersion = 1.2;
> >>> >>
> >>> >> SpecName = Java Virtual Machine Specification;
> >>> >>
> >>> >> SpecVendor = Sun Microsystems Inc.;
> >>> >>
> >>> >> SpecVersion = 1.0;
> >>> >>
> >>> >> StartTime = 1272911001415;
> >>> >> ...
> >>> >
> >>
> >>
> >
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>

Re: performance tuning - where does the slowness come from?

Posted by Jonathan Ellis <jb...@gmail.com>.

Are you using 32 bit hosts?  If not don't be scared of mmap using a
lot of address space, you have plenty.  It won't make you swap more
than using buffered i/o.

On Tue, May 4, 2010 at 1:57 PM, Ran Tavory <ra...@gmail.com> wrote:
> I canceled mmap and indeed memory usage is sane again. So far performance
> hasn't been great, but I'll wait and see.
> I'm also interested in a way to cap mmap so I can take advantage of it but
> not swap the host to death...
>
> On Tue, May 4, 2010 at 9:38 PM, Kyusik Chung <ky...@discovereads.com>
> wrote:
>>
>> This sounds just like the slowness I was asking about in another thread -
>> after a lot of reads, the machine uses up all available memory on the box
>> and then starts swapping.
>> My understanding was that mmap helps greatly with read and write perf
>> (until the box starts swapping I guess)...is there any way to use mmap and
>> cap how much memory it takes up?
>> What do people use in production?  mmap or no mmap?
>> Thanks!
>> Kyusik Chung
>> On May 4, 2010, at 10:11 AM, Schubert Zhang wrote:
>>
>> 1. When initially startup your nodes, please plan your InitialToken of
>> each node evenly.
>> 2. <DiskAccessMode>standard</DiskAccessMode>
>>
>> On Tue, May 4, 2010 at 9:09 PM, Boris Shulman <sh...@gmail.com> wrote:
>>>
>>> I think that the extra (more than 4GB) memory usage comes from the
>>> mmaped io, that is why it happens only for reads.
>>>
>>> On Tue, May 4, 2010 at 2:02 PM, Jordan Pittier <jo...@gmail.com>
>>> wrote:
>>> > I'm facing the same issue with swap. It only occurs when I perform read
>>> > operations (write are very fast :)). So I can't help you with the
>>> > memory
>>> > probleme.
>>> >
>>> > But to balance the load evenly between nodes in cluster just manually
>>> > fix
>>> > their token.(the "formula" is i * 2^127 / nb_nodes).
>>> >
>>> > Jordzn
>>> >
>>> > On Tue, May 4, 2010 at 8:20 AM, Ran Tavory <ra...@gmail.com> wrote:
>>> >>
>>> >> I'm looking into performance issues on a 0.6.1 cluster. I see two
>>> >> symptoms:
>>> >> 1. Reads and writes are slow
>>> >> 2. One of the hosts is doing a lot of GC.
>>> >> 1 is slow in the sense that in normal state the cluster used to make
>>> >> around 3-5k read and writes per second (6-10k operations per second),
>>> >> but
>>> >> how it's in the order of 200-400 ops per second, sometimes even less.
>>> >> 2 looks like this:
>>> >> $ tail -f /outbrain/cassandra/log/system.log
>>> >>  INFO [GC inspection] 2010-05-04 00:42:18,636 GCInspector.java (line
>>> >> 110)
>>> >> GC for ParNew: 672 ms, 166482384 reclaimed leaving 2872087208 used;
>>> >> max is
>>> >> 4432068608
>>> >>  INFO [GC inspection] 2010-05-04 00:42:28,638 GCInspector.java (line
>>> >> 110)
>>> >> GC for ParNew: 498 ms, 166493352 reclaimed leaving 2836049448 used;
>>> >> max is
>>> >> 4432068608
>>> >>  INFO [GC inspection] 2010-05-04 00:42:38,640 GCInspector.java (line
>>> >> 110)
>>> >> GC for ParNew: 327 ms, 166091528 reclaimed leaving 2796888424 used;
>>> >> max is
>>> >> 4432068608
>>> >> ... and it goes on and on for hours, no stopping...
>>> >> The cluster is made of 6 hosts, 3 in one DC and 3 in another.
>>> >> Each host has 8G RAM.
>>> >> -Xmx=4G
>>> >> For some reason, the load isn't distributed evenly b/w the hosts,
>>> >> although
>>> >> I'm not sure this is the cause for slowness
>>> >> $ nodetool -h localhost -p 9004 ring
>>> >> Address       Status     Load          Range
>>> >>        Ring
>>> >>
>>> >> 144413773383729447702215082383444206680
>>> >> 192.168.252.99Up         15.94 GB
>>> >>  66002764663998929243644931915471302076     |<--|
>>> >> 192.168.254.57Up         19.84 GB
>>> >>  81288739225600737067856268063987022738     |   ^
>>> >> 192.168.254.58Up         973.78 MB
>>> >> 86999744104066390588161689990810839743     v   |
>>> >> 192.168.252.62Up         5.18 GB
>>> >> 88308919879653155454332084719458267849     |   ^
>>> >> 192.168.254.59Up         10.57 GB
>>> >>  142482163220375328195837946953175033937    v   |
>>> >> 192.168.252.61Up         11.36 GB
>>> >>  144413773383729447702215082383444206680    |-->|
>>> >> The slow host is 192.168.252.61 and it isn't the most loaded one.
>>> >> The host is waiting a lot on IO and the load average is usually 6-7
>>> >> $ w
>>> >>  00:42:56 up 11 days, 13:22,  1 user,  load average: 6.21, 5.52, 3.93
>>> >> $ vmstat 5
>>> >> procs -----------memory---------- ---swap-- -----io---- --system--
>>> >> -----cpu------
>>> >>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us
>>> >> sy id
>>> >> wa st
>>> >>  0  8 2147844  45744   1816 4457384    6    5    66    32    5    2  1
>>> >>  1
>>> >> 96  2  0
>>> >>  0  8 2147164  49020   1808 4451596  385    0  2345    58 3372 9957  2
>>> >>  2
>>> >> 78 18  0
>>> >>  0  3 2146432  45704   1812 4453956  342    0  2274   108 3937 10732
>>> >>  2  2
>>> >> 78 19  0
>>> >>  0  1 2146252  44696   1804 4453436  345  164  1939   294 3647 7833  2
>>> >>  2
>>> >> 78 18  0
>>> >>  0  1 2145960  46924   1744 4451260  158    0  2423   122 4354 14597
>>> >>  2  2
>>> >> 77 18  0
>>> >>  7  1 2138344  44676    952 4504148 1722  403  1722   406 1388  439 87
>>> >>  0
>>> >> 10  2  0
>>> >>  7  2 2137248  45652    956 4499436 1384  655  1384   658 1356  392 87
>>> >>  0
>>> >> 10  3  0
>>> >>  7  1 2135976  46764    956 4495020 1366  718  1366   718 1395  380 87
>>> >>  0
>>> >>  9  4  0
>>> >>  0  8 2134484  46964    956 4489420 1673  555  1814   586 1601 215590
>>> >> 14
>>> >>  2 68 16  0
>>> >>  0  1 2135388  47444    972 4488516  785  833  2390   995 3812 8305  2
>>> >>  2
>>> >> 77 20  0
>>> >>  0 10 2135164  45928    980 4488796  788  543  2275   626 36
>>> >> So, the host is swapping like crazy...
>>> >> top shows that it's using a lot of memory. As noted before -Xmx=4G and
>>> >> nothing else seems to be using a lot of memory on the host except for
>>> >> the
>>> >> cassandra process, however, of the 8G ram on the host, 92% is used by
>>> >> cassandra. How's that?
>>> >> Top shows there's 3.9g Shared and 7.2g Resident and 15.9g Virtual. Why
>>> >> does it have 15g virtual? And why 7.2 RES? This can explain the
>>> >> slowness in
>>> >> swapping.
>>> >> $ top
>>> >>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>> >>
>>> >>
>>> >> 20281 cassandr  25   0 15.9g 7.2g 3.9g S 33.3 92.6 175:30.27 java
>>> >> So, can the total memory be controlled?
>>> >> Or perhaps I'm looking in the wrong direction...
>>> >> I've looked at all the cassandra JMX counts and nothing seemed
>>> >> suspicious
>>> >> so far. By suspicious i mean a large number of pending tasks - there
>>> >> were
>>> >> always very small numbers in each pool.
>>> >> About read and write latencies, I'm not sure what the normal state is,
>>> >> but
>>> >> here's an example of what I see on the problematic host:
>>> >> #mbean = org.apache.cassandra.service:type=StorageProxy:
>>> >> RecentReadLatencyMicros = 30105.888180684495;
>>> >> TotalReadLatencyMicros = 78543052801;
>>> >> TotalWriteLatencyMicros = 4213118609;
>>> >> RecentWriteLatencyMicros = 1444.4809201925639;
>>> >> ReadOperations = 4779553;
>>> >> RangeOperations = 0;
>>> >> TotalRangeLatencyMicros = 0;
>>> >> RecentRangeLatencyMicros = NaN;
>>> >> WriteOperations = 4740093;
>>> >> And the only pool that I do see some pending tasks is the
>>> >> ROW-READ-STAGE,
>>> >> but it doesn't look like much, usually around 6-8:
>>> >> #mbean = org.apache.cassandra.concurrent:type=ROW-READ-STAGE:
>>> >> ActiveCount = 8;
>>> >> PendingTasks = 8;
>>> >> CompletedTasks = 5427955;
>>> >> Any help finding the solution is appreciated, thanks...
>>> >> Below are a few more JMXes I collected from the system that may be
>>> >> interesting.
>>> >> #mbean = java.lang:type=Memory:
>>> >> Verbose = false;
>>> >> HeapMemoryUsage = {
>>> >>   committed = 3767279616;
>>> >>   init = 134217728;
>>> >>   max = 4293656576;
>>> >>   used = 1237105080;
>>> >>  };
>>> >> NonHeapMemoryUsage = {
>>> >>   committed = 35061760;
>>> >>   init = 24313856;
>>> >>   max = 138412032;
>>> >>   used = 23151320;
>>> >>  };
>>> >> ObjectPendingFinalizationCount = 0;
>>> >> #mbean = java.lang:name=ParNew,type=GarbageCollector:
>>> >> LastGcInfo = {
>>> >>   GcThreadCount = 11;
>>> >>   duration = 136;
>>> >>   endTime = 42219272;
>>> >>   id = 11719;
>>> >>   memoryUsageAfterGc = {
>>> >>     ( CMS Perm Gen ) = {
>>> >>       key = CMS Perm Gen;
>>> >>       value = {
>>> >>         committed = 29229056;
>>> >>         init = 21757952;
>>> >>         max = 88080384;
>>> >>         used = 17648848;
>>> >>        };
>>> >>      };
>>> >>     ( Code Cache ) = {
>>> >>       key = Code Cache;
>>> >>       value = {
>>> >>         committed = 5832704;
>>> >>         init = 2555904;
>>> >>         max = 50331648;
>>> >>         used = 5563520;
>>> >>        };
>>> >>      };
>>> >>     ( CMS Old Gen ) = {
>>> >>       key = CMS Old Gen;
>>> >>       value = {
>>> >>         committed = 3594133504;
>>> >>         init = 112459776;
>>> >>         max = 4120510464;
>>> >>         used = 964565720;
>>> >>        };
>>> >>      };
>>> >>     ( Par Eden Space ) = {
>>> >>       key = Par Eden Space;
>>> >>       value = {
>>> >>         committed = 171835392;
>>> >>         init = 21495808;
>>> >>         max = 171835392;
>>> >>         used = 0;
>>> >>        };
>>> >>      };
>>> >>     ( Par Survivor Space ) = {
>>> >>       key = Par Survivor Space;
>>> >>       value = {
>>> >>         committed = 1310720;
>>> >>         init = 131072;
>>> >>         max = 1310720;
>>> >>         used = 0;
>>> >>        };
>>> >>      };
>>> >>    };
>>> >>   memoryUsageBeforeGc = {
>>> >>     ( CMS Perm Gen ) = {
>>> >>       key = CMS Perm Gen;
>>> >>       value = {
>>> >>         committed = 29229056;
>>> >>         init = 21757952;
>>> >>         max = 88080384;
>>> >>         used = 17648848;
>>> >>        };
>>> >>      };
>>> >>     ( Code Cache ) = {
>>> >>       key = Code Cache;
>>> >>       value = {
>>> >>         committed = 5832704;
>>> >>         init = 2555904;
>>> >>         max = 50331648;
>>> >>         used = 5563520;
>>> >>        };
>>> >>      };
>>> >>     ( CMS Old Gen ) = {
>>> >>       key = CMS Old Gen;
>>> >>       value = {
>>> >>         committed = 3594133504;
>>> >>         init = 112459776;
>>> >>         max = 4120510464;
>>> >>         used = 959221872;
>>> >>        };
>>> >>      };
>>> >>     ( Par Eden Space ) = {
>>> >>       key = Par Eden Space;
>>> >>       value = {
>>> >>         committed = 171835392;
>>> >>         init = 21495808;
>>> >>         max = 171835392;
>>> >>         used = 171835392;
>>> >>        };
>>> >>      };
>>> >>     ( Par Survivor Space ) = {
>>> >>       key = Par Survivor Space;
>>> >>       value = {
>>> >>         committed = 1310720;
>>> >>         init = 131072;
>>> >>         max = 1310720;
>>> >>         used = 0;
>>> >>        };
>>> >>      };
>>> >>    };
>>> >>   startTime = 42219136;
>>> >>  };
>>> >> CollectionCount = 11720;
>>> >> CollectionTime = 4561730;
>>> >> Name = ParNew;
>>> >> Valid = true;
>>> >> MemoryPoolNames = [ Par Eden Space, Par Survivor Space ];
>>> >> #mbean = java.lang:type=OperatingSystem:
>>> >> MaxFileDescriptorCount = 63536;
>>> >> OpenFileDescriptorCount = 75;
>>> >> CommittedVirtualMemorySize = 17787711488;
>>> >> FreePhysicalMemorySize = 45522944;
>>> >> FreeSwapSpaceSize = 2123968512;
>>> >> ProcessCpuTime = 12251460000000;
>>> >> TotalPhysicalMemorySize = 8364417024;
>>> >> TotalSwapSpaceSize = 4294959104;
>>> >> Name = Linux;
>>> >> AvailableProcessors = 8;
>>> >> Arch = amd64;
>>> >> SystemLoadAverage = 4.36;
>>> >> Version = 2.6.18-164.15.1.el5;
>>> >> #mbean = java.lang:type=Runtime:
>>> >> Name = 20281@ob1061.nydc1.outbrain.com;
>>> >>
>>> >> ClassPath =
>>> >>
>>> >> /outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../build/classes:/outbrain/cassandra/apache-cassandra-0.6.1/bin/..
>>> >>
>>> >>
>>> >> /lib/antlr-3.1.3.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/apache-cassandra-0.6.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/avro-1.2.0-dev.jar:/outb
>>> >>
>>> >>
>>> >> rain/cassandra/apache-cassandra-0.6.1/bin/../lib/clhm-production.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-cli-1.1.jar:/outbrain/cassandra/apache-cassandra-
>>> >>
>>> >>
>>> >> 0.6.1/bin/../lib/commons-codec-1.2.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-collections-3.2.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/com
>>> >>
>>> >>
>>> >> mons-lang-2.4.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/google-collections-1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/hadoop-core-0.20.1.jar:/out
>>> >>
>>> >>
>>> >> brain/cassandra/apache-cassandra-0.6.1/bin/../lib/high-scale-lib.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/ivy-2.1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/
>>> >>
>>> >>
>>> >> bin/../lib/jackson-core-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jackson-mapper-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jline
>>> >>
>>> >>
>>> >> -0.9.94.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/json-simple-1.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/libthrift-r917130.jar:/outbrain/cassandr
>>> >>
>>> >>
>>> >> a/apache-cassandra-0.6.1/bin/../lib/log4j-1.2.14.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/slf4j-api-1.5.8.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib
>>> >> /slf4j-log4j12-1.5.8.jar;
>>> >>
>>> >> BootClassPath =
>>> >>
>>> >> /usr/java/jdk1.6.0_17/jre/lib/alt-rt.jar:/usr/java/jdk1.6.0_17/jre/lib/resources.jar:/usr/java/jdk1.6.0_17/jre/lib/rt.jar:/usr/java/jdk1.6.0_17/jre/lib/sunrsasign.j
>>> >>
>>> >>
>>> >> ar:/usr/java/jdk1.6.0_17/jre/lib/jsse.jar:/usr/java/jdk1.6.0_17/jre/lib/jce.jar:/usr/java/jdk1.6.0_17/jre/lib/charsets.jar:/usr/java/jdk1.6.0_17/jre/classes;
>>> >>
>>> >> LibraryPath =
>>> >>
>>> >> /usr/java/jdk1.6.0_17/jre/lib/amd64/server:/usr/java/jdk1.6.0_17/jre/lib/amd64:/usr/java/jdk1.6.0_17/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib;
>>> >>
>>> >> VmName = Java HotSpot(TM) 64-Bit Server VM;
>>> >>
>>> >> VmVendor = Sun Microsystems Inc.;
>>> >>
>>> >> VmVersion = 14.3-b01;
>>> >>
>>> >> BootClassPathSupported = true;
>>> >>
>>> >> InputArguments = [ -ea, -Xms128M, -Xmx4G, -XX:TargetSurvivorRatio=90,
>>> >> -XX:+AggressiveOpts, -XX:+UseParNewGC, -XX:+UseConcMarkSweepGC,
>>> >> -XX:+CMSParallelRemarkEnabled, -XX:+HeapDumpOnOutOfMemoryError,
>>> >> -XX:SurvivorRatio=128, -XX:MaxTenuringThreshold=0,
>>> >> -Dcom.sun.management.jmxremote.port=9004,
>>> >> -Dcom.sun.management.jmxremote.ssl=false,
>>> >> -Dcom.sun.management.jmxremote.authenticate=false,
>>> >>
>>> >> -Dstorage-config=/outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf,
>>> >> -Dcassandra-pidfile=/var/run/cassandra.pid ];
>>> >>
>>> >> ManagementSpecVersion = 1.2;
>>> >>
>>> >> SpecName = Java Virtual Machine Specification;
>>> >>
>>> >> SpecVendor = Sun Microsystems Inc.;
>>> >>
>>> >> SpecVersion = 1.0;
>>> >>
>>> >> StartTime = 1272911001415;
>>> >> ...
>>> >
>>
>>
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: performance tuning - where does the slowness come from?

Posted by Vick Khera <vi...@khera.org>.

On Tue, May 4, 2010 at 2:57 PM, Ran Tavory <ra...@gmail.com> wrote:
> I'm also interested in a way to cap mmap so I can take advantage of it but
> not swap the host to death...
>

Isn't the point of mmap() to just directly access a file as if it were
memory?  I can see how it would fool the reporting tools into thinking
that much memory was in use since it is part of your virtual address
space, but I am not following why it would actually *use* more memory.

Perhaps it is overwhelming the buffer cache and that is pushing other
data pages out to swap?  Just  a wild-assed guess here.

Re: performance tuning - where does the slowness come from?

Posted by Nathan McCall <na...@vervewireless.com>.

You could try mmap_index_only - this would restrict mmap usage to the
index files.

-Nate

On Tue, May 4, 2010 at 11:57 AM, Ran Tavory <ra...@gmail.com> wrote:
> I canceled mmap and indeed memory usage is sane again. So far performance
> hasn't been great, but I'll wait and see.
> I'm also interested in a way to cap mmap so I can take advantage of it but
> not swap the host to death...
>
> On Tue, May 4, 2010 at 9:38 PM, Kyusik Chung <ky...@discovereads.com>
> wrote:
>>
>> This sounds just like the slowness I was asking about in another thread -
>> after a lot of reads, the machine uses up all available memory on the box
>> and then starts swapping.
>> My understanding was that mmap helps greatly with read and write perf
>> (until the box starts swapping I guess)...is there any way to use mmap and
>> cap how much memory it takes up?
>> What do people use in production?  mmap or no mmap?
>> Thanks!
>> Kyusik Chung
>> On May 4, 2010, at 10:11 AM, Schubert Zhang wrote:
>>
>> 1. When initially startup your nodes, please plan your InitialToken of
>> each node evenly.
>> 2. <DiskAccessMode>standard</DiskAccessMode>
>>
>> On Tue, May 4, 2010 at 9:09 PM, Boris Shulman <sh...@gmail.com> wrote:
>>>
>>> I think that the extra (more than 4GB) memory usage comes from the
>>> mmaped io, that is why it happens only for reads.
>>>
>>> On Tue, May 4, 2010 at 2:02 PM, Jordan Pittier <jo...@gmail.com>
>>> wrote:
>>> > I'm facing the same issue with swap. It only occurs when I perform read
>>> > operations (write are very fast :)). So I can't help you with the
>>> > memory
>>> > probleme.
>>> >
>>> > But to balance the load evenly between nodes in cluster just manually
>>> > fix
>>> > their token.(the "formula" is i * 2^127 / nb_nodes).
>>> >
>>> > Jordzn
>>> >
>>> > On Tue, May 4, 2010 at 8:20 AM, Ran Tavory <ra...@gmail.com> wrote:
>>> >>
>>> >> I'm looking into performance issues on a 0.6.1 cluster. I see two
>>> >> symptoms:
>>> >> 1. Reads and writes are slow
>>> >> 2. One of the hosts is doing a lot of GC.
>>> >> 1 is slow in the sense that in normal state the cluster used to make
>>> >> around 3-5k read and writes per second (6-10k operations per second),
>>> >> but
>>> >> how it's in the order of 200-400 ops per second, sometimes even less.
>>> >> 2 looks like this:
>>> >> $ tail -f /outbrain/cassandra/log/system.log
>>> >>  INFO [GC inspection] 2010-05-04 00:42:18,636 GCInspector.java (line
>>> >> 110)
>>> >> GC for ParNew: 672 ms, 166482384 reclaimed leaving 2872087208 used;
>>> >> max is
>>> >> 4432068608
>>> >>  INFO [GC inspection] 2010-05-04 00:42:28,638 GCInspector.java (line
>>> >> 110)
>>> >> GC for ParNew: 498 ms, 166493352 reclaimed leaving 2836049448 used;
>>> >> max is
>>> >> 4432068608
>>> >>  INFO [GC inspection] 2010-05-04 00:42:38,640 GCInspector.java (line
>>> >> 110)
>>> >> GC for ParNew: 327 ms, 166091528 reclaimed leaving 2796888424 used;
>>> >> max is
>>> >> 4432068608
>>> >> ... and it goes on and on for hours, no stopping...
>>> >> The cluster is made of 6 hosts, 3 in one DC and 3 in another.
>>> >> Each host has 8G RAM.
>>> >> -Xmx=4G
>>> >> For some reason, the load isn't distributed evenly b/w the hosts,
>>> >> although
>>> >> I'm not sure this is the cause for slowness
>>> >> $ nodetool -h localhost -p 9004 ring
>>> >> Address       Status     Load          Range
>>> >>        Ring
>>> >>
>>> >> 144413773383729447702215082383444206680
>>> >> 192.168.252.99Up         15.94 GB
>>> >>  66002764663998929243644931915471302076     |<--|
>>> >> 192.168.254.57Up         19.84 GB
>>> >>  81288739225600737067856268063987022738     |   ^
>>> >> 192.168.254.58Up         973.78 MB
>>> >> 86999744104066390588161689990810839743     v   |
>>> >> 192.168.252.62Up         5.18 GB
>>> >> 88308919879653155454332084719458267849     |   ^
>>> >> 192.168.254.59Up         10.57 GB
>>> >>  142482163220375328195837946953175033937    v   |
>>> >> 192.168.252.61Up         11.36 GB
>>> >>  144413773383729447702215082383444206680    |-->|
>>> >> The slow host is 192.168.252.61 and it isn't the most loaded one.
>>> >> The host is waiting a lot on IO and the load average is usually 6-7
>>> >> $ w
>>> >>  00:42:56 up 11 days, 13:22,  1 user,  load average: 6.21, 5.52, 3.93
>>> >> $ vmstat 5
>>> >> procs -----------memory---------- ---swap-- -----io---- --system--
>>> >> -----cpu------
>>> >>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us
>>> >> sy id
>>> >> wa st
>>> >>  0  8 2147844  45744   1816 4457384    6    5    66    32    5    2  1
>>> >>  1
>>> >> 96  2  0
>>> >>  0  8 2147164  49020   1808 4451596  385    0  2345    58 3372 9957  2
>>> >>  2
>>> >> 78 18  0
>>> >>  0  3 2146432  45704   1812 4453956  342    0  2274   108 3937 10732
>>> >>  2  2
>>> >> 78 19  0
>>> >>  0  1 2146252  44696   1804 4453436  345  164  1939   294 3647 7833  2
>>> >>  2
>>> >> 78 18  0
>>> >>  0  1 2145960  46924   1744 4451260  158    0  2423   122 4354 14597
>>> >>  2  2
>>> >> 77 18  0
>>> >>  7  1 2138344  44676    952 4504148 1722  403  1722   406 1388  439 87
>>> >>  0
>>> >> 10  2  0
>>> >>  7  2 2137248  45652    956 4499436 1384  655  1384   658 1356  392 87
>>> >>  0
>>> >> 10  3  0
>>> >>  7  1 2135976  46764    956 4495020 1366  718  1366   718 1395  380 87
>>> >>  0
>>> >>  9  4  0
>>> >>  0  8 2134484  46964    956 4489420 1673  555  1814   586 1601 215590
>>> >> 14
>>> >>  2 68 16  0
>>> >>  0  1 2135388  47444    972 4488516  785  833  2390   995 3812 8305  2
>>> >>  2
>>> >> 77 20  0
>>> >>  0 10 2135164  45928    980 4488796  788  543  2275   626 36
>>> >> So, the host is swapping like crazy...
>>> >> top shows that it's using a lot of memory. As noted before -Xmx=4G and
>>> >> nothing else seems to be using a lot of memory on the host except for
>>> >> the
>>> >> cassandra process, however, of the 8G ram on the host, 92% is used by
>>> >> cassandra. How's that?
>>> >> Top shows there's 3.9g Shared and 7.2g Resident and 15.9g Virtual. Why
>>> >> does it have 15g virtual? And why 7.2 RES? This can explain the
>>> >> slowness in
>>> >> swapping.
>>> >> $ top
>>> >>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>> >>
>>> >>
>>> >> 20281 cassandr  25   0 15.9g 7.2g 3.9g S 33.3 92.6 175:30.27 java
>>> >> So, can the total memory be controlled?
>>> >> Or perhaps I'm looking in the wrong direction...
>>> >> I've looked at all the cassandra JMX counts and nothing seemed
>>> >> suspicious
>>> >> so far. By suspicious i mean a large number of pending tasks - there
>>> >> were
>>> >> always very small numbers in each pool.
>>> >> About read and write latencies, I'm not sure what the normal state is,
>>> >> but
>>> >> here's an example of what I see on the problematic host:
>>> >> #mbean = org.apache.cassandra.service:type=StorageProxy:
>>> >> RecentReadLatencyMicros = 30105.888180684495;
>>> >> TotalReadLatencyMicros = 78543052801;
>>> >> TotalWriteLatencyMicros = 4213118609;
>>> >> RecentWriteLatencyMicros = 1444.4809201925639;
>>> >> ReadOperations = 4779553;
>>> >> RangeOperations = 0;
>>> >> TotalRangeLatencyMicros = 0;
>>> >> RecentRangeLatencyMicros = NaN;
>>> >> WriteOperations = 4740093;
>>> >> And the only pool that I do see some pending tasks is the
>>> >> ROW-READ-STAGE,
>>> >> but it doesn't look like much, usually around 6-8:
>>> >> #mbean = org.apache.cassandra.concurrent:type=ROW-READ-STAGE:
>>> >> ActiveCount = 8;
>>> >> PendingTasks = 8;
>>> >> CompletedTasks = 5427955;
>>> >> Any help finding the solution is appreciated, thanks...
>>> >> Below are a few more JMXes I collected from the system that may be
>>> >> interesting.
>>> >> #mbean = java.lang:type=Memory:
>>> >> Verbose = false;
>>> >> HeapMemoryUsage = {
>>> >>   committed = 3767279616;
>>> >>   init = 134217728;
>>> >>   max = 4293656576;
>>> >>   used = 1237105080;
>>> >>  };
>>> >> NonHeapMemoryUsage = {
>>> >>   committed = 35061760;
>>> >>   init = 24313856;
>>> >>   max = 138412032;
>>> >>   used = 23151320;
>>> >>  };
>>> >> ObjectPendingFinalizationCount = 0;
>>> >> #mbean = java.lang:name=ParNew,type=GarbageCollector:
>>> >> LastGcInfo = {
>>> >>   GcThreadCount = 11;
>>> >>   duration = 136;
>>> >>   endTime = 42219272;
>>> >>   id = 11719;
>>> >>   memoryUsageAfterGc = {
>>> >>     ( CMS Perm Gen ) = {
>>> >>       key = CMS Perm Gen;
>>> >>       value = {
>>> >>         committed = 29229056;
>>> >>         init = 21757952;
>>> >>         max = 88080384;
>>> >>         used = 17648848;
>>> >>        };
>>> >>      };
>>> >>     ( Code Cache ) = {
>>> >>       key = Code Cache;
>>> >>       value = {
>>> >>         committed = 5832704;
>>> >>         init = 2555904;
>>> >>         max = 50331648;
>>> >>         used = 5563520;
>>> >>        };
>>> >>      };
>>> >>     ( CMS Old Gen ) = {
>>> >>       key = CMS Old Gen;
>>> >>       value = {
>>> >>         committed = 3594133504;
>>> >>         init = 112459776;
>>> >>         max = 4120510464;
>>> >>         used = 964565720;
>>> >>        };
>>> >>      };
>>> >>     ( Par Eden Space ) = {
>>> >>       key = Par Eden Space;
>>> >>       value = {
>>> >>         committed = 171835392;
>>> >>         init = 21495808;
>>> >>         max = 171835392;
>>> >>         used = 0;
>>> >>        };
>>> >>      };
>>> >>     ( Par Survivor Space ) = {
>>> >>       key = Par Survivor Space;
>>> >>       value = {
>>> >>         committed = 1310720;
>>> >>         init = 131072;
>>> >>         max = 1310720;
>>> >>         used = 0;
>>> >>        };
>>> >>      };
>>> >>    };
>>> >>   memoryUsageBeforeGc = {
>>> >>     ( CMS Perm Gen ) = {
>>> >>       key = CMS Perm Gen;
>>> >>       value = {
>>> >>         committed = 29229056;
>>> >>         init = 21757952;
>>> >>         max = 88080384;
>>> >>         used = 17648848;
>>> >>        };
>>> >>      };
>>> >>     ( Code Cache ) = {
>>> >>       key = Code Cache;
>>> >>       value = {
>>> >>         committed = 5832704;
>>> >>         init = 2555904;
>>> >>         max = 50331648;
>>> >>         used = 5563520;
>>> >>        };
>>> >>      };
>>> >>     ( CMS Old Gen ) = {
>>> >>       key = CMS Old Gen;
>>> >>       value = {
>>> >>         committed = 3594133504;
>>> >>         init = 112459776;
>>> >>         max = 4120510464;
>>> >>         used = 959221872;
>>> >>        };
>>> >>      };
>>> >>     ( Par Eden Space ) = {
>>> >>       key = Par Eden Space;
>>> >>       value = {
>>> >>         committed = 171835392;
>>> >>         init = 21495808;
>>> >>         max = 171835392;
>>> >>         used = 171835392;
>>> >>        };
>>> >>      };
>>> >>     ( Par Survivor Space ) = {
>>> >>       key = Par Survivor Space;
>>> >>       value = {
>>> >>         committed = 1310720;
>>> >>         init = 131072;
>>> >>         max = 1310720;
>>> >>         used = 0;
>>> >>        };
>>> >>      };
>>> >>    };
>>> >>   startTime = 42219136;
>>> >>  };
>>> >> CollectionCount = 11720;
>>> >> CollectionTime = 4561730;
>>> >> Name = ParNew;
>>> >> Valid = true;
>>> >> MemoryPoolNames = [ Par Eden Space, Par Survivor Space ];
>>> >> #mbean = java.lang:type=OperatingSystem:
>>> >> MaxFileDescriptorCount = 63536;
>>> >> OpenFileDescriptorCount = 75;
>>> >> CommittedVirtualMemorySize = 17787711488;
>>> >> FreePhysicalMemorySize = 45522944;
>>> >> FreeSwapSpaceSize = 2123968512;
>>> >> ProcessCpuTime = 12251460000000;
>>> >> TotalPhysicalMemorySize = 8364417024;
>>> >> TotalSwapSpaceSize = 4294959104;
>>> >> Name = Linux;
>>> >> AvailableProcessors = 8;
>>> >> Arch = amd64;
>>> >> SystemLoadAverage = 4.36;
>>> >> Version = 2.6.18-164.15.1.el5;
>>> >> #mbean = java.lang:type=Runtime:
>>> >> Name = 20281@ob1061.nydc1.outbrain.com;
>>> >>
>>> >> ClassPath =
>>> >>
>>> >> /outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../build/classes:/outbrain/cassandra/apache-cassandra-0.6.1/bin/..
>>> >>
>>> >>
>>> >> /lib/antlr-3.1.3.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/apache-cassandra-0.6.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/avro-1.2.0-dev.jar:/outb
>>> >>
>>> >>
>>> >> rain/cassandra/apache-cassandra-0.6.1/bin/../lib/clhm-production.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-cli-1.1.jar:/outbrain/cassandra/apache-cassandra-
>>> >>
>>> >>
>>> >> 0.6.1/bin/../lib/commons-codec-1.2.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-collections-3.2.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/com
>>> >>
>>> >>
>>> >> mons-lang-2.4.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/google-collections-1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/hadoop-core-0.20.1.jar:/out
>>> >>
>>> >>
>>> >> brain/cassandra/apache-cassandra-0.6.1/bin/../lib/high-scale-lib.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/ivy-2.1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/
>>> >>
>>> >>
>>> >> bin/../lib/jackson-core-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jackson-mapper-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jline
>>> >>
>>> >>
>>> >> -0.9.94.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/json-simple-1.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/libthrift-r917130.jar:/outbrain/cassandr
>>> >>
>>> >>
>>> >> a/apache-cassandra-0.6.1/bin/../lib/log4j-1.2.14.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/slf4j-api-1.5.8.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib
>>> >> /slf4j-log4j12-1.5.8.jar;
>>> >>
>>> >> BootClassPath =
>>> >>
>>> >> /usr/java/jdk1.6.0_17/jre/lib/alt-rt.jar:/usr/java/jdk1.6.0_17/jre/lib/resources.jar:/usr/java/jdk1.6.0_17/jre/lib/rt.jar:/usr/java/jdk1.6.0_17/jre/lib/sunrsasign.j
>>> >>
>>> >>
>>> >> ar:/usr/java/jdk1.6.0_17/jre/lib/jsse.jar:/usr/java/jdk1.6.0_17/jre/lib/jce.jar:/usr/java/jdk1.6.0_17/jre/lib/charsets.jar:/usr/java/jdk1.6.0_17/jre/classes;
>>> >>
>>> >> LibraryPath =
>>> >>
>>> >> /usr/java/jdk1.6.0_17/jre/lib/amd64/server:/usr/java/jdk1.6.0_17/jre/lib/amd64:/usr/java/jdk1.6.0_17/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib;
>>> >>
>>> >> VmName = Java HotSpot(TM) 64-Bit Server VM;
>>> >>
>>> >> VmVendor = Sun Microsystems Inc.;
>>> >>
>>> >> VmVersion = 14.3-b01;
>>> >>
>>> >> BootClassPathSupported = true;
>>> >>
>>> >> InputArguments = [ -ea, -Xms128M, -Xmx4G, -XX:TargetSurvivorRatio=90,
>>> >> -XX:+AggressiveOpts, -XX:+UseParNewGC, -XX:+UseConcMarkSweepGC,
>>> >> -XX:+CMSParallelRemarkEnabled, -XX:+HeapDumpOnOutOfMemoryError,
>>> >> -XX:SurvivorRatio=128, -XX:MaxTenuringThreshold=0,
>>> >> -Dcom.sun.management.jmxremote.port=9004,
>>> >> -Dcom.sun.management.jmxremote.ssl=false,
>>> >> -Dcom.sun.management.jmxremote.authenticate=false,
>>> >>
>>> >> -Dstorage-config=/outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf,
>>> >> -Dcassandra-pidfile=/var/run/cassandra.pid ];
>>> >>
>>> >> ManagementSpecVersion = 1.2;
>>> >>
>>> >> SpecName = Java Virtual Machine Specification;
>>> >>
>>> >> SpecVendor = Sun Microsystems Inc.;
>>> >>
>>> >> SpecVersion = 1.0;
>>> >>
>>> >> StartTime = 1272911001415;
>>> >> ...
>>> >
>>
>>
>
>

Re: performance tuning - where does the slowness come from?

Posted by Ran Tavory <ra...@gmail.com>.

I canceled mmap and indeed memory usage is sane again. So far performance
hasn't been great, but I'll wait and see.

I'm also interested in a way to cap mmap so I can take advantage of it but
not swap the host to death...

On Tue, May 4, 2010 at 9:38 PM, Kyusik Chung <ky...@discovereads.com>wrote:

> This sounds just like the slowness I was asking about in another thread -
> after a lot of reads, the machine uses up all available memory on the box
> and then starts swapping.
>
> My understanding was that mmap helps greatly with read and write perf
> (until the box starts swapping I guess)...is there any way to use mmap and
> cap how much memory it takes up?
>
> What do people use in production?  mmap or no mmap?
>
> Thanks!
>
> Kyusik Chung
>
> On May 4, 2010, at 10:11 AM, Schubert Zhang wrote:
>
> 1. When initially startup your nodes, please plan your InitialToken of each
> node evenly.
> 2. <DiskAccessMode>standard</DiskAccessMode>
>
> On Tue, May 4, 2010 at 9:09 PM, Boris Shulman <sh...@gmail.com> wrote:
>
>> I think that the extra (more than 4GB) memory usage comes from the
>> mmaped io, that is why it happens only for reads.
>>
>> On Tue, May 4, 2010 at 2:02 PM, Jordan Pittier <jo...@gmail.com>
>> wrote:
>> > I'm facing the same issue with swap. It only occurs when I perform read
>> > operations (write are very fast :)). So I can't help you with the memory
>> > probleme.
>> >
>> > But to balance the load evenly between nodes in cluster just manually
>> fix
>> > their token.(the "formula" is i * 2^127 / nb_nodes).
>> >
>> > Jordzn
>> >
>> > On Tue, May 4, 2010 at 8:20 AM, Ran Tavory <ra...@gmail.com> wrote:
>> >>
>> >> I'm looking into performance issues on a 0.6.1 cluster. I see two
>> >> symptoms:
>> >> 1. Reads and writes are slow
>> >> 2. One of the hosts is doing a lot of GC.
>> >> 1 is slow in the sense that in normal state the cluster used to make
>> >> around 3-5k read and writes per second (6-10k operations per second),
>> but
>> >> how it's in the order of 200-400 ops per second, sometimes even less.
>> >> 2 looks like this:
>> >> $ tail -f /outbrain/cassandra/log/system.log
>> >>  INFO [GC inspection] 2010-05-04 00:42:18,636 GCInspector.java (line
>> 110)
>> >> GC for ParNew: 672 ms, 166482384 reclaimed leaving 2872087208 used; max
>> is
>> >> 4432068608
>> >>  INFO [GC inspection] 2010-05-04 00:42:28,638 GCInspector.java (line
>> 110)
>> >> GC for ParNew: 498 ms, 166493352 reclaimed leaving 2836049448 used; max
>> is
>> >> 4432068608
>> >>  INFO [GC inspection] 2010-05-04 00:42:38,640 GCInspector.java (line
>> 110)
>> >> GC for ParNew: 327 ms, 166091528 reclaimed leaving 2796888424 used; max
>> is
>> >> 4432068608
>> >> ... and it goes on and on for hours, no stopping...
>> >> The cluster is made of 6 hosts, 3 in one DC and 3 in another.
>> >> Each host has 8G RAM.
>> >> -Xmx=4G
>> >> For some reason, the load isn't distributed evenly b/w the hosts,
>> although
>> >> I'm not sure this is the cause for slowness
>> >> $ nodetool -h localhost -p 9004 ring
>> >> Address       Status     Load          Range
>> >>        Ring
>> >>
>> >> 144413773383729447702215082383444206680
>> >> 192.168.252.99Up         15.94 GB
>> >>  66002764663998929243644931915471302076     |<--|
>> >> 192.168.254.57Up         19.84 GB
>> >>  81288739225600737067856268063987022738     |   ^
>> >> 192.168.254.58Up         973.78 MB
>> >> 86999744104066390588161689990810839743     v   |
>> >> 192.168.252.62Up         5.18 GB
>> >> 88308919879653155454332084719458267849     |   ^
>> >> 192.168.254.59Up         10.57 GB
>> >>  142482163220375328195837946953175033937    v   |
>> >> 192.168.252.61Up         11.36 GB
>> >>  144413773383729447702215082383444206680    |-->|
>> >> The slow host is 192.168.252.61 and it isn't the most loaded one.
>> >> The host is waiting a lot on IO and the load average is usually 6-7
>> >> $ w
>> >>  00:42:56 up 11 days, 13:22,  1 user,  load average: 6.21, 5.52, 3.93
>> >> $ vmstat 5
>> >> procs -----------memory---------- ---swap-- -----io---- --system--
>> >> -----cpu------
>> >>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy
>> id
>> >> wa st
>> >>  0  8 2147844  45744   1816 4457384    6    5    66    32    5    2  1
>>  1
>> >> 96  2  0
>> >>  0  8 2147164  49020   1808 4451596  385    0  2345    58 3372 9957  2
>>  2
>> >> 78 18  0
>> >>  0  3 2146432  45704   1812 4453956  342    0  2274   108 3937 10732  2
>>  2
>> >> 78 19  0
>> >>  0  1 2146252  44696   1804 4453436  345  164  1939   294 3647 7833  2
>>  2
>> >> 78 18  0
>> >>  0  1 2145960  46924   1744 4451260  158    0  2423   122 4354 14597  2
>>  2
>> >> 77 18  0
>> >>  7  1 2138344  44676    952 4504148 1722  403  1722   406 1388  439 87
>>  0
>> >> 10  2  0
>> >>  7  2 2137248  45652    956 4499436 1384  655  1384   658 1356  392 87
>>  0
>> >> 10  3  0
>> >>  7  1 2135976  46764    956 4495020 1366  718  1366   718 1395  380 87
>>  0
>> >>  9  4  0
>> >>  0  8 2134484  46964    956 4489420 1673  555  1814   586 1601 215590
>> 14
>> >>  2 68 16  0
>> >>  0  1 2135388  47444    972 4488516  785  833  2390   995 3812 8305  2
>>  2
>> >> 77 20  0
>> >>  0 10 2135164  45928    980 4488796  788  543  2275   626 36
>> >> So, the host is swapping like crazy...
>> >> top shows that it's using a lot of memory. As noted before -Xmx=4G and
>> >> nothing else seems to be using a lot of memory on the host except for
>> the
>> >> cassandra process, however, of the 8G ram on the host, 92% is used by
>> >> cassandra. How's that?
>> >> Top shows there's 3.9g Shared and 7.2g Resident and 15.9g Virtual. Why
>> >> does it have 15g virtual? And why 7.2 RES? This can explain the
>> slowness in
>> >> swapping.
>> >> $ top
>> >>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>> >>
>> >>
>> >> 20281 cassandr  25   0 15.9g 7.2g 3.9g S 33.3 92.6 175:30.27 java
>> >> So, can the total memory be controlled?
>> >> Or perhaps I'm looking in the wrong direction...
>> >> I've looked at all the cassandra JMX counts and nothing seemed
>> suspicious
>> >> so far. By suspicious i mean a large number of pending tasks - there
>> were
>> >> always very small numbers in each pool.
>> >> About read and write latencies, I'm not sure what the normal state is,
>> but
>> >> here's an example of what I see on the problematic host:
>> >> #mbean = org.apache.cassandra.service:type=StorageProxy:
>> >> RecentReadLatencyMicros = 30105.888180684495;
>> >> TotalReadLatencyMicros = 78543052801;
>> >> TotalWriteLatencyMicros = 4213118609;
>> >> RecentWriteLatencyMicros = 1444.4809201925639;
>> >> ReadOperations = 4779553;
>> >> RangeOperations = 0;
>> >> TotalRangeLatencyMicros = 0;
>> >> RecentRangeLatencyMicros = NaN;
>> >> WriteOperations = 4740093;
>> >> And the only pool that I do see some pending tasks is the
>> ROW-READ-STAGE,
>> >> but it doesn't look like much, usually around 6-8:
>> >> #mbean = org.apache.cassandra.concurrent:type=ROW-READ-STAGE:
>> >> ActiveCount = 8;
>> >> PendingTasks = 8;
>> >> CompletedTasks = 5427955;
>> >> Any help finding the solution is appreciated, thanks...
>> >> Below are a few more JMXes I collected from the system that may be
>> >> interesting.
>> >> #mbean = java.lang:type=Memory:
>> >> Verbose = false;
>> >> HeapMemoryUsage = {
>> >>   committed = 3767279616;
>> >>   init = 134217728;
>> >>   max = 4293656576;
>> >>   used = 1237105080;
>> >>  };
>> >> NonHeapMemoryUsage = {
>> >>   committed = 35061760;
>> >>   init = 24313856;
>> >>   max = 138412032;
>> >>   used = 23151320;
>> >>  };
>> >> ObjectPendingFinalizationCount = 0;
>> >> #mbean = java.lang:name=ParNew,type=GarbageCollector:
>> >> LastGcInfo = {
>> >>   GcThreadCount = 11;
>> >>   duration = 136;
>> >>   endTime = 42219272;
>> >>   id = 11719;
>> >>   memoryUsageAfterGc = {
>> >>     ( CMS Perm Gen ) = {
>> >>       key = CMS Perm Gen;
>> >>       value = {
>> >>         committed = 29229056;
>> >>         init = 21757952;
>> >>         max = 88080384;
>> >>         used = 17648848;
>> >>        };
>> >>      };
>> >>     ( Code Cache ) = {
>> >>       key = Code Cache;
>> >>       value = {
>> >>         committed = 5832704;
>> >>         init = 2555904;
>> >>         max = 50331648;
>> >>         used = 5563520;
>> >>        };
>> >>      };
>> >>     ( CMS Old Gen ) = {
>> >>       key = CMS Old Gen;
>> >>       value = {
>> >>         committed = 3594133504;
>> >>         init = 112459776;
>> >>         max = 4120510464;
>> >>         used = 964565720;
>> >>        };
>> >>      };
>> >>     ( Par Eden Space ) = {
>> >>       key = Par Eden Space;
>> >>       value = {
>> >>         committed = 171835392;
>> >>         init = 21495808;
>> >>         max = 171835392;
>> >>         used = 0;
>> >>        };
>> >>      };
>> >>     ( Par Survivor Space ) = {
>> >>       key = Par Survivor Space;
>> >>       value = {
>> >>         committed = 1310720;
>> >>         init = 131072;
>> >>         max = 1310720;
>> >>         used = 0;
>> >>        };
>> >>      };
>> >>    };
>> >>   memoryUsageBeforeGc = {
>> >>     ( CMS Perm Gen ) = {
>> >>       key = CMS Perm Gen;
>> >>       value = {
>> >>         committed = 29229056;
>> >>         init = 21757952;
>> >>         max = 88080384;
>> >>         used = 17648848;
>> >>        };
>> >>      };
>> >>     ( Code Cache ) = {
>> >>       key = Code Cache;
>> >>       value = {
>> >>         committed = 5832704;
>> >>         init = 2555904;
>> >>         max = 50331648;
>> >>         used = 5563520;
>> >>        };
>> >>      };
>> >>     ( CMS Old Gen ) = {
>> >>       key = CMS Old Gen;
>> >>       value = {
>> >>         committed = 3594133504;
>> >>         init = 112459776;
>> >>         max = 4120510464;
>> >>         used = 959221872;
>> >>        };
>> >>      };
>> >>     ( Par Eden Space ) = {
>> >>       key = Par Eden Space;
>> >>       value = {
>> >>         committed = 171835392;
>> >>         init = 21495808;
>> >>         max = 171835392;
>> >>         used = 171835392;
>> >>        };
>> >>      };
>> >>     ( Par Survivor Space ) = {
>> >>       key = Par Survivor Space;
>> >>       value = {
>> >>         committed = 1310720;
>> >>         init = 131072;
>> >>         max = 1310720;
>> >>         used = 0;
>> >>        };
>> >>      };
>> >>    };
>> >>   startTime = 42219136;
>> >>  };
>> >> CollectionCount = 11720;
>> >> CollectionTime = 4561730;
>> >> Name = ParNew;
>> >> Valid = true;
>> >> MemoryPoolNames = [ Par Eden Space, Par Survivor Space ];
>> >> #mbean = java.lang:type=OperatingSystem:
>> >> MaxFileDescriptorCount = 63536;
>> >> OpenFileDescriptorCount = 75;
>> >> CommittedVirtualMemorySize = 17787711488;
>> >> FreePhysicalMemorySize = 45522944;
>> >> FreeSwapSpaceSize = 2123968512;
>> >> ProcessCpuTime = 12251460000000;
>> >> TotalPhysicalMemorySize = 8364417024;
>> >> TotalSwapSpaceSize = 4294959104;
>> >> Name = Linux;
>> >> AvailableProcessors = 8;
>> >> Arch = amd64;
>> >> SystemLoadAverage = 4.36;
>> >> Version = 2.6.18-164.15.1.el5;
>> >> #mbean = java.lang:type=Runtime:
>> >> Name = 20281@ob1061.nydc1.outbrain.com;
>> >>
>> >> ClassPath =
>> >>
>> /outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../build/classes:/outbrain/cassandra/apache-cassandra-0.6.1/bin/..
>> >>
>> >>
>> /lib/antlr-3.1.3.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/apache-cassandra-0.6.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/avro-1.2.0-dev.jar:/outb
>> >>
>> >>
>> rain/cassandra/apache-cassandra-0.6.1/bin/../lib/clhm-production.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-cli-1.1.jar:/outbrain/cassandra/apache-cassandra-
>> >>
>> >>
>> 0.6.1/bin/../lib/commons-codec-1.2.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-collections-3.2.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/com
>> >>
>> >>
>> mons-lang-2.4.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/google-collections-1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/hadoop-core-0.20.1.jar:/out
>> >>
>> >>
>> brain/cassandra/apache-cassandra-0.6.1/bin/../lib/high-scale-lib.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/ivy-2.1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/
>> >>
>> >>
>> bin/../lib/jackson-core-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jackson-mapper-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jline
>> >>
>> >>
>> -0.9.94.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/json-simple-1.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/libthrift-r917130.jar:/outbrain/cassandr
>> >>
>> >>
>> a/apache-cassandra-0.6.1/bin/../lib/log4j-1.2.14.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/slf4j-api-1.5.8.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib
>> >> /slf4j-log4j12-1.5.8.jar;
>> >>
>> >> BootClassPath =
>> >>
>> /usr/java/jdk1.6.0_17/jre/lib/alt-rt.jar:/usr/java/jdk1.6.0_17/jre/lib/resources.jar:/usr/java/jdk1.6.0_17/jre/lib/rt.jar:/usr/java/jdk1.6.0_17/jre/lib/sunrsasign.j
>> >>
>> >>
>> ar:/usr/java/jdk1.6.0_17/jre/lib/jsse.jar:/usr/java/jdk1.6.0_17/jre/lib/jce.jar:/usr/java/jdk1.6.0_17/jre/lib/charsets.jar:/usr/java/jdk1.6.0_17/jre/classes;
>> >>
>> >> LibraryPath =
>> >>
>> /usr/java/jdk1.6.0_17/jre/lib/amd64/server:/usr/java/jdk1.6.0_17/jre/lib/amd64:/usr/java/jdk1.6.0_17/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib;
>> >>
>> >> VmName = Java HotSpot(TM) 64-Bit Server VM;
>> >>
>> >> VmVendor = Sun Microsystems Inc.;
>> >>
>> >> VmVersion = 14.3-b01;
>> >>
>> >> BootClassPathSupported = true;
>> >>
>> >> InputArguments = [ -ea, -Xms128M, -Xmx4G, -XX:TargetSurvivorRatio=90,
>> >> -XX:+AggressiveOpts, -XX:+UseParNewGC, -XX:+UseConcMarkSweepGC,
>> >> -XX:+CMSParallelRemarkEnabled, -XX:+HeapDumpOnOutOfMemoryError,
>> >> -XX:SurvivorRatio=128, -XX:MaxTenuringThreshold=0,
>> >> -Dcom.sun.management.jmxremote.port=9004,
>> >> -Dcom.sun.management.jmxremote.ssl=false,
>> >> -Dcom.sun.management.jmxremote.authenticate=false,
>> >>
>> -Dstorage-config=/outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf,
>> >> -Dcassandra-pidfile=/var/run/cassandra.pid ];
>> >>
>> >> ManagementSpecVersion = 1.2;
>> >>
>> >> SpecName = Java Virtual Machine Specification;
>> >>
>> >> SpecVendor = Sun Microsystems Inc.;
>> >>
>> >> SpecVersion = 1.0;
>> >>
>> >> StartTime = 1272911001415;
>> >> ...
>> >
>>
>
>
>

Re: performance tuning - where does the slowness come from?

Posted by Kyusik Chung <ky...@discovereads.com>.

This sounds just like the slowness I was asking about in another thread - after a lot of reads, the machine uses up all available memory on the box and then starts swapping.

My understanding was that mmap helps greatly with read and write perf (until the box starts swapping I guess)...is there any way to use mmap and cap how much memory it takes up?

What do people use in production?  mmap or no mmap?

Thanks!

Kyusik Chung

On May 4, 2010, at 10:11 AM, Schubert Zhang wrote:

> 1. When initially startup your nodes, please plan your InitialToken of each node evenly.
> 2. <DiskAccessMode>standard</DiskAccessMode>
> 
> On Tue, May 4, 2010 at 9:09 PM, Boris Shulman <sh...@gmail.com> wrote:
> I think that the extra (more than 4GB) memory usage comes from the
> mmaped io, that is why it happens only for reads.
> 
> On Tue, May 4, 2010 at 2:02 PM, Jordan Pittier <jo...@gmail.com> wrote:
> > I'm facing the same issue with swap. It only occurs when I perform read
> > operations (write are very fast :)). So I can't help you with the memory
> > probleme.
> >
> > But to balance the load evenly between nodes in cluster just manually fix
> > their token.(the "formula" is i * 2^127 / nb_nodes).
> >
> > Jordzn
> >
> > On Tue, May 4, 2010 at 8:20 AM, Ran Tavory <ra...@gmail.com> wrote:
> >>
> >> I'm looking into performance issues on a 0.6.1 cluster. I see two
> >> symptoms:
> >> 1. Reads and writes are slow
> >> 2. One of the hosts is doing a lot of GC.
> >> 1 is slow in the sense that in normal state the cluster used to make
> >> around 3-5k read and writes per second (6-10k operations per second), but
> >> how it's in the order of 200-400 ops per second, sometimes even less.
> >> 2 looks like this:
> >> $ tail -f /outbrain/cassandra/log/system.log
> >>  INFO [GC inspection] 2010-05-04 00:42:18,636 GCInspector.java (line 110)
> >> GC for ParNew: 672 ms, 166482384 reclaimed leaving 2872087208 used; max is
> >> 4432068608
> >>  INFO [GC inspection] 2010-05-04 00:42:28,638 GCInspector.java (line 110)
> >> GC for ParNew: 498 ms, 166493352 reclaimed leaving 2836049448 used; max is
> >> 4432068608
> >>  INFO [GC inspection] 2010-05-04 00:42:38,640 GCInspector.java (line 110)
> >> GC for ParNew: 327 ms, 166091528 reclaimed leaving 2796888424 used; max is
> >> 4432068608
> >> ... and it goes on and on for hours, no stopping...
> >> The cluster is made of 6 hosts, 3 in one DC and 3 in another.
> >> Each host has 8G RAM.
> >> -Xmx=4G
> >> For some reason, the load isn't distributed evenly b/w the hosts, although
> >> I'm not sure this is the cause for slowness
> >> $ nodetool -h localhost -p 9004 ring
> >> Address       Status     Load          Range
> >>        Ring
> >>
> >> 144413773383729447702215082383444206680
> >> 192.168.252.99Up         15.94 GB
> >>  66002764663998929243644931915471302076     |<--|
> >> 192.168.254.57Up         19.84 GB
> >>  81288739225600737067856268063987022738     |   ^
> >> 192.168.254.58Up         973.78 MB
> >> 86999744104066390588161689990810839743     v   |
> >> 192.168.252.62Up         5.18 GB
> >> 88308919879653155454332084719458267849     |   ^
> >> 192.168.254.59Up         10.57 GB
> >>  142482163220375328195837946953175033937    v   |
> >> 192.168.252.61Up         11.36 GB
> >>  144413773383729447702215082383444206680    |-->|
> >> The slow host is 192.168.252.61 and it isn't the most loaded one.
> >> The host is waiting a lot on IO and the load average is usually 6-7
> >> $ w
> >>  00:42:56 up 11 days, 13:22,  1 user,  load average: 6.21, 5.52, 3.93
> >> $ vmstat 5
> >> procs -----------memory---------- ---swap-- -----io---- --system--
> >> -----cpu------
> >>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id
> >> wa st
> >>  0  8 2147844  45744   1816 4457384    6    5    66    32    5    2  1  1
> >> 96  2  0
> >>  0  8 2147164  49020   1808 4451596  385    0  2345    58 3372 9957  2  2
> >> 78 18  0
> >>  0  3 2146432  45704   1812 4453956  342    0  2274   108 3937 10732  2  2
> >> 78 19  0
> >>  0  1 2146252  44696   1804 4453436  345  164  1939   294 3647 7833  2  2
> >> 78 18  0
> >>  0  1 2145960  46924   1744 4451260  158    0  2423   122 4354 14597  2  2
> >> 77 18  0
> >>  7  1 2138344  44676    952 4504148 1722  403  1722   406 1388  439 87  0
> >> 10  2  0
> >>  7  2 2137248  45652    956 4499436 1384  655  1384   658 1356  392 87  0
> >> 10  3  0
> >>  7  1 2135976  46764    956 4495020 1366  718  1366   718 1395  380 87  0
> >>  9  4  0
> >>  0  8 2134484  46964    956 4489420 1673  555  1814   586 1601 215590 14
> >>  2 68 16  0
> >>  0  1 2135388  47444    972 4488516  785  833  2390   995 3812 8305  2  2
> >> 77 20  0
> >>  0 10 2135164  45928    980 4488796  788  543  2275   626 36
> >> So, the host is swapping like crazy...
> >> top shows that it's using a lot of memory. As noted before -Xmx=4G and
> >> nothing else seems to be using a lot of memory on the host except for the
> >> cassandra process, however, of the 8G ram on the host, 92% is used by
> >> cassandra. How's that?
> >> Top shows there's 3.9g Shared and 7.2g Resident and 15.9g Virtual. Why
> >> does it have 15g virtual? And why 7.2 RES? This can explain the slowness in
> >> swapping.
> >> $ top
> >>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> >>
> >>
> >> 20281 cassandr  25   0 15.9g 7.2g 3.9g S 33.3 92.6 175:30.27 java
> >> So, can the total memory be controlled?
> >> Or perhaps I'm looking in the wrong direction...
> >> I've looked at all the cassandra JMX counts and nothing seemed suspicious
> >> so far. By suspicious i mean a large number of pending tasks - there were
> >> always very small numbers in each pool.
> >> About read and write latencies, I'm not sure what the normal state is, but
> >> here's an example of what I see on the problematic host:
> >> #mbean = org.apache.cassandra.service:type=StorageProxy:
> >> RecentReadLatencyMicros = 30105.888180684495;
> >> TotalReadLatencyMicros = 78543052801;
> >> TotalWriteLatencyMicros = 4213118609;
> >> RecentWriteLatencyMicros = 1444.4809201925639;
> >> ReadOperations = 4779553;
> >> RangeOperations = 0;
> >> TotalRangeLatencyMicros = 0;
> >> RecentRangeLatencyMicros = NaN;
> >> WriteOperations = 4740093;
> >> And the only pool that I do see some pending tasks is the ROW-READ-STAGE,
> >> but it doesn't look like much, usually around 6-8:
> >> #mbean = org.apache.cassandra.concurrent:type=ROW-READ-STAGE:
> >> ActiveCount = 8;
> >> PendingTasks = 8;
> >> CompletedTasks = 5427955;
> >> Any help finding the solution is appreciated, thanks...
> >> Below are a few more JMXes I collected from the system that may be
> >> interesting.
> >> #mbean = java.lang:type=Memory:
> >> Verbose = false;
> >> HeapMemoryUsage = {
> >>   committed = 3767279616;
> >>   init = 134217728;
> >>   max = 4293656576;
> >>   used = 1237105080;
> >>  };
> >> NonHeapMemoryUsage = {
> >>   committed = 35061760;
> >>   init = 24313856;
> >>   max = 138412032;
> >>   used = 23151320;
> >>  };
> >> ObjectPendingFinalizationCount = 0;
> >> #mbean = java.lang:name=ParNew,type=GarbageCollector:
> >> LastGcInfo = {
> >>   GcThreadCount = 11;
> >>   duration = 136;
> >>   endTime = 42219272;
> >>   id = 11719;
> >>   memoryUsageAfterGc = {
> >>     ( CMS Perm Gen ) = {
> >>       key = CMS Perm Gen;
> >>       value = {
> >>         committed = 29229056;
> >>         init = 21757952;
> >>         max = 88080384;
> >>         used = 17648848;
> >>        };
> >>      };
> >>     ( Code Cache ) = {
> >>       key = Code Cache;
> >>       value = {
> >>         committed = 5832704;
> >>         init = 2555904;
> >>         max = 50331648;
> >>         used = 5563520;
> >>        };
> >>      };
> >>     ( CMS Old Gen ) = {
> >>       key = CMS Old Gen;
> >>       value = {
> >>         committed = 3594133504;
> >>         init = 112459776;
> >>         max = 4120510464;
> >>         used = 964565720;
> >>        };
> >>      };
> >>     ( Par Eden Space ) = {
> >>       key = Par Eden Space;
> >>       value = {
> >>         committed = 171835392;
> >>         init = 21495808;
> >>         max = 171835392;
> >>         used = 0;
> >>        };
> >>      };
> >>     ( Par Survivor Space ) = {
> >>       key = Par Survivor Space;
> >>       value = {
> >>         committed = 1310720;
> >>         init = 131072;
> >>         max = 1310720;
> >>         used = 0;
> >>        };
> >>      };
> >>    };
> >>   memoryUsageBeforeGc = {
> >>     ( CMS Perm Gen ) = {
> >>       key = CMS Perm Gen;
> >>       value = {
> >>         committed = 29229056;
> >>         init = 21757952;
> >>         max = 88080384;
> >>         used = 17648848;
> >>        };
> >>      };
> >>     ( Code Cache ) = {
> >>       key = Code Cache;
> >>       value = {
> >>         committed = 5832704;
> >>         init = 2555904;
> >>         max = 50331648;
> >>         used = 5563520;
> >>        };
> >>      };
> >>     ( CMS Old Gen ) = {
> >>       key = CMS Old Gen;
> >>       value = {
> >>         committed = 3594133504;
> >>         init = 112459776;
> >>         max = 4120510464;
> >>         used = 959221872;
> >>        };
> >>      };
> >>     ( Par Eden Space ) = {
> >>       key = Par Eden Space;
> >>       value = {
> >>         committed = 171835392;
> >>         init = 21495808;
> >>         max = 171835392;
> >>         used = 171835392;
> >>        };
> >>      };
> >>     ( Par Survivor Space ) = {
> >>       key = Par Survivor Space;
> >>       value = {
> >>         committed = 1310720;
> >>         init = 131072;
> >>         max = 1310720;
> >>         used = 0;
> >>        };
> >>      };
> >>    };
> >>   startTime = 42219136;
> >>  };
> >> CollectionCount = 11720;
> >> CollectionTime = 4561730;
> >> Name = ParNew;
> >> Valid = true;
> >> MemoryPoolNames = [ Par Eden Space, Par Survivor Space ];
> >> #mbean = java.lang:type=OperatingSystem:
> >> MaxFileDescriptorCount = 63536;
> >> OpenFileDescriptorCount = 75;
> >> CommittedVirtualMemorySize = 17787711488;
> >> FreePhysicalMemorySize = 45522944;
> >> FreeSwapSpaceSize = 2123968512;
> >> ProcessCpuTime = 12251460000000;
> >> TotalPhysicalMemorySize = 8364417024;
> >> TotalSwapSpaceSize = 4294959104;
> >> Name = Linux;
> >> AvailableProcessors = 8;
> >> Arch = amd64;
> >> SystemLoadAverage = 4.36;
> >> Version = 2.6.18-164.15.1.el5;
> >> #mbean = java.lang:type=Runtime:
> >> Name = 20281@ob1061.nydc1.outbrain.com;
> >>
> >> ClassPath =
> >> /outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../build/classes:/outbrain/cassandra/apache-cassandra-0.6.1/bin/..
> >>
> >> /lib/antlr-3.1.3.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/apache-cassandra-0.6.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/avro-1.2.0-dev.jar:/outb
> >>
> >> rain/cassandra/apache-cassandra-0.6.1/bin/../lib/clhm-production.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-cli-1.1.jar:/outbrain/cassandra/apache-cassandra-
> >>
> >> 0.6.1/bin/../lib/commons-codec-1.2.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-collections-3.2.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/com
> >>
> >> mons-lang-2.4.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/google-collections-1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/hadoop-core-0.20.1.jar:/out
> >>
> >> brain/cassandra/apache-cassandra-0.6.1/bin/../lib/high-scale-lib.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/ivy-2.1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/
> >>
> >> bin/../lib/jackson-core-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jackson-mapper-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jline
> >>
> >> -0.9.94.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/json-simple-1.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/libthrift-r917130.jar:/outbrain/cassandr
> >>
> >> a/apache-cassandra-0.6.1/bin/../lib/log4j-1.2.14.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/slf4j-api-1.5.8.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib
> >> /slf4j-log4j12-1.5.8.jar;
> >>
> >> BootClassPath =
> >> /usr/java/jdk1.6.0_17/jre/lib/alt-rt.jar:/usr/java/jdk1.6.0_17/jre/lib/resources.jar:/usr/java/jdk1.6.0_17/jre/lib/rt.jar:/usr/java/jdk1.6.0_17/jre/lib/sunrsasign.j
> >>
> >> ar:/usr/java/jdk1.6.0_17/jre/lib/jsse.jar:/usr/java/jdk1.6.0_17/jre/lib/jce.jar:/usr/java/jdk1.6.0_17/jre/lib/charsets.jar:/usr/java/jdk1.6.0_17/jre/classes;
> >>
> >> LibraryPath =
> >> /usr/java/jdk1.6.0_17/jre/lib/amd64/server:/usr/java/jdk1.6.0_17/jre/lib/amd64:/usr/java/jdk1.6.0_17/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib;
> >>
> >> VmName = Java HotSpot(TM) 64-Bit Server VM;
> >>
> >> VmVendor = Sun Microsystems Inc.;
> >>
> >> VmVersion = 14.3-b01;
> >>
> >> BootClassPathSupported = true;
> >>
> >> InputArguments = [ -ea, -Xms128M, -Xmx4G, -XX:TargetSurvivorRatio=90,
> >> -XX:+AggressiveOpts, -XX:+UseParNewGC, -XX:+UseConcMarkSweepGC,
> >> -XX:+CMSParallelRemarkEnabled, -XX:+HeapDumpOnOutOfMemoryError,
> >> -XX:SurvivorRatio=128, -XX:MaxTenuringThreshold=0,
> >> -Dcom.sun.management.jmxremote.port=9004,
> >> -Dcom.sun.management.jmxremote.ssl=false,
> >> -Dcom.sun.management.jmxremote.authenticate=false,
> >> -Dstorage-config=/outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf,
> >> -Dcassandra-pidfile=/var/run/cassandra.pid ];
> >>
> >> ManagementSpecVersion = 1.2;
> >>
> >> SpecName = Java Virtual Machine Specification;
> >>
> >> SpecVendor = Sun Microsystems Inc.;
> >>
> >> SpecVersion = 1.0;
> >>
> >> StartTime = 1272911001415;
> >> ...
> >
>

Re: performance tuning - where does the slowness come from?

Posted by Schubert Zhang <zs...@gmail.com>.

1. When initially startup your nodes, please plan your InitialToken of each
node evenly.
2. <DiskAccessMode>standard</DiskAccessMode>

On Tue, May 4, 2010 at 9:09 PM, Boris Shulman <sh...@gmail.com> wrote:

> I think that the extra (more than 4GB) memory usage comes from the
> mmaped io, that is why it happens only for reads.
>
> On Tue, May 4, 2010 at 2:02 PM, Jordan Pittier <jo...@gmail.com>
> wrote:
> > I'm facing the same issue with swap. It only occurs when I perform read
> > operations (write are very fast :)). So I can't help you with the memory
> > probleme.
> >
> > But to balance the load evenly between nodes in cluster just manually fix
> > their token.(the "formula" is i * 2^127 / nb_nodes).
> >
> > Jordzn
> >
> > On Tue, May 4, 2010 at 8:20 AM, Ran Tavory <ra...@gmail.com> wrote:
> >>
> >> I'm looking into performance issues on a 0.6.1 cluster. I see two
> >> symptoms:
> >> 1. Reads and writes are slow
> >> 2. One of the hosts is doing a lot of GC.
> >> 1 is slow in the sense that in normal state the cluster used to make
> >> around 3-5k read and writes per second (6-10k operations per second),
> but
> >> how it's in the order of 200-400 ops per second, sometimes even less.
> >> 2 looks like this:
> >> $ tail -f /outbrain/cassandra/log/system.log
> >>  INFO [GC inspection] 2010-05-04 00:42:18,636 GCInspector.java (line
> 110)
> >> GC for ParNew: 672 ms, 166482384 reclaimed leaving 2872087208 used; max
> is
> >> 4432068608
> >>  INFO [GC inspection] 2010-05-04 00:42:28,638 GCInspector.java (line
> 110)
> >> GC for ParNew: 498 ms, 166493352 reclaimed leaving 2836049448 used; max
> is
> >> 4432068608
> >>  INFO [GC inspection] 2010-05-04 00:42:38,640 GCInspector.java (line
> 110)
> >> GC for ParNew: 327 ms, 166091528 reclaimed leaving 2796888424 used; max
> is
> >> 4432068608
> >> ... and it goes on and on for hours, no stopping...
> >> The cluster is made of 6 hosts, 3 in one DC and 3 in another.
> >> Each host has 8G RAM.
> >> -Xmx=4G
> >> For some reason, the load isn't distributed evenly b/w the hosts,
> although
> >> I'm not sure this is the cause for slowness
> >> $ nodetool -h localhost -p 9004 ring
> >> Address       Status     Load          Range
> >>        Ring
> >>
> >> 144413773383729447702215082383444206680
> >> 192.168.252.99Up         15.94 GB
> >>  66002764663998929243644931915471302076     |<--|
> >> 192.168.254.57Up         19.84 GB
> >>  81288739225600737067856268063987022738     |   ^
> >> 192.168.254.58Up         973.78 MB
> >> 86999744104066390588161689990810839743     v   |
> >> 192.168.252.62Up         5.18 GB
> >> 88308919879653155454332084719458267849     |   ^
> >> 192.168.254.59Up         10.57 GB
> >>  142482163220375328195837946953175033937    v   |
> >> 192.168.252.61Up         11.36 GB
> >>  144413773383729447702215082383444206680    |-->|
> >> The slow host is 192.168.252.61 and it isn't the most loaded one.
> >> The host is waiting a lot on IO and the load average is usually 6-7
> >> $ w
> >>  00:42:56 up 11 days, 13:22,  1 user,  load average: 6.21, 5.52, 3.93
> >> $ vmstat 5
> >> procs -----------memory---------- ---swap-- -----io---- --system--
> >> -----cpu------
> >>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy
> id
> >> wa st
> >>  0  8 2147844  45744   1816 4457384    6    5    66    32    5    2  1
>  1
> >> 96  2  0
> >>  0  8 2147164  49020   1808 4451596  385    0  2345    58 3372 9957  2
>  2
> >> 78 18  0
> >>  0  3 2146432  45704   1812 4453956  342    0  2274   108 3937 10732  2
>  2
> >> 78 19  0
> >>  0  1 2146252  44696   1804 4453436  345  164  1939   294 3647 7833  2
>  2
> >> 78 18  0
> >>  0  1 2145960  46924   1744 4451260  158    0  2423   122 4354 14597  2
>  2
> >> 77 18  0
> >>  7  1 2138344  44676    952 4504148 1722  403  1722   406 1388  439 87
>  0
> >> 10  2  0
> >>  7  2 2137248  45652    956 4499436 1384  655  1384   658 1356  392 87
>  0
> >> 10  3  0
> >>  7  1 2135976  46764    956 4495020 1366  718  1366   718 1395  380 87
>  0
> >>  9  4  0
> >>  0  8 2134484  46964    956 4489420 1673  555  1814   586 1601 215590 14
> >>  2 68 16  0
> >>  0  1 2135388  47444    972 4488516  785  833  2390   995 3812 8305  2
>  2
> >> 77 20  0
> >>  0 10 2135164  45928    980 4488796  788  543  2275   626 36
> >> So, the host is swapping like crazy...
> >> top shows that it's using a lot of memory. As noted before -Xmx=4G and
> >> nothing else seems to be using a lot of memory on the host except for
> the
> >> cassandra process, however, of the 8G ram on the host, 92% is used by
> >> cassandra. How's that?
> >> Top shows there's 3.9g Shared and 7.2g Resident and 15.9g Virtual. Why
> >> does it have 15g virtual? And why 7.2 RES? This can explain the slowness
> in
> >> swapping.
> >> $ top
> >>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> >>
> >>
> >> 20281 cassandr  25   0 15.9g 7.2g 3.9g S 33.3 92.6 175:30.27 java
> >> So, can the total memory be controlled?
> >> Or perhaps I'm looking in the wrong direction...
> >> I've looked at all the cassandra JMX counts and nothing seemed
> suspicious
> >> so far. By suspicious i mean a large number of pending tasks - there
> were
> >> always very small numbers in each pool.
> >> About read and write latencies, I'm not sure what the normal state is,
> but
> >> here's an example of what I see on the problematic host:
> >> #mbean = org.apache.cassandra.service:type=StorageProxy:
> >> RecentReadLatencyMicros = 30105.888180684495;
> >> TotalReadLatencyMicros = 78543052801;
> >> TotalWriteLatencyMicros = 4213118609;
> >> RecentWriteLatencyMicros = 1444.4809201925639;
> >> ReadOperations = 4779553;
> >> RangeOperations = 0;
> >> TotalRangeLatencyMicros = 0;
> >> RecentRangeLatencyMicros = NaN;
> >> WriteOperations = 4740093;
> >> And the only pool that I do see some pending tasks is the
> ROW-READ-STAGE,
> >> but it doesn't look like much, usually around 6-8:
> >> #mbean = org.apache.cassandra.concurrent:type=ROW-READ-STAGE:
> >> ActiveCount = 8;
> >> PendingTasks = 8;
> >> CompletedTasks = 5427955;
> >> Any help finding the solution is appreciated, thanks...
> >> Below are a few more JMXes I collected from the system that may be
> >> interesting.
> >> #mbean = java.lang:type=Memory:
> >> Verbose = false;
> >> HeapMemoryUsage = {
> >>   committed = 3767279616;
> >>   init = 134217728;
> >>   max = 4293656576;
> >>   used = 1237105080;
> >>  };
> >> NonHeapMemoryUsage = {
> >>   committed = 35061760;
> >>   init = 24313856;
> >>   max = 138412032;
> >>   used = 23151320;
> >>  };
> >> ObjectPendingFinalizationCount = 0;
> >> #mbean = java.lang:name=ParNew,type=GarbageCollector:
> >> LastGcInfo = {
> >>   GcThreadCount = 11;
> >>   duration = 136;
> >>   endTime = 42219272;
> >>   id = 11719;
> >>   memoryUsageAfterGc = {
> >>     ( CMS Perm Gen ) = {
> >>       key = CMS Perm Gen;
> >>       value = {
> >>         committed = 29229056;
> >>         init = 21757952;
> >>         max = 88080384;
> >>         used = 17648848;
> >>        };
> >>      };
> >>     ( Code Cache ) = {
> >>       key = Code Cache;
> >>       value = {
> >>         committed = 5832704;
> >>         init = 2555904;
> >>         max = 50331648;
> >>         used = 5563520;
> >>        };
> >>      };
> >>     ( CMS Old Gen ) = {
> >>       key = CMS Old Gen;
> >>       value = {
> >>         committed = 3594133504;
> >>         init = 112459776;
> >>         max = 4120510464;
> >>         used = 964565720;
> >>        };
> >>      };
> >>     ( Par Eden Space ) = {
> >>       key = Par Eden Space;
> >>       value = {
> >>         committed = 171835392;
> >>         init = 21495808;
> >>         max = 171835392;
> >>         used = 0;
> >>        };
> >>      };
> >>     ( Par Survivor Space ) = {
> >>       key = Par Survivor Space;
> >>       value = {
> >>         committed = 1310720;
> >>         init = 131072;
> >>         max = 1310720;
> >>         used = 0;
> >>        };
> >>      };
> >>    };
> >>   memoryUsageBeforeGc = {
> >>     ( CMS Perm Gen ) = {
> >>       key = CMS Perm Gen;
> >>       value = {
> >>         committed = 29229056;
> >>         init = 21757952;
> >>         max = 88080384;
> >>         used = 17648848;
> >>        };
> >>      };
> >>     ( Code Cache ) = {
> >>       key = Code Cache;
> >>       value = {
> >>         committed = 5832704;
> >>         init = 2555904;
> >>         max = 50331648;
> >>         used = 5563520;
> >>        };
> >>      };
> >>     ( CMS Old Gen ) = {
> >>       key = CMS Old Gen;
> >>       value = {
> >>         committed = 3594133504;
> >>         init = 112459776;
> >>         max = 4120510464;
> >>         used = 959221872;
> >>        };
> >>      };
> >>     ( Par Eden Space ) = {
> >>       key = Par Eden Space;
> >>       value = {
> >>         committed = 171835392;
> >>         init = 21495808;
> >>         max = 171835392;
> >>         used = 171835392;
> >>        };
> >>      };
> >>     ( Par Survivor Space ) = {
> >>       key = Par Survivor Space;
> >>       value = {
> >>         committed = 1310720;
> >>         init = 131072;
> >>         max = 1310720;
> >>         used = 0;
> >>        };
> >>      };
> >>    };
> >>   startTime = 42219136;
> >>  };
> >> CollectionCount = 11720;
> >> CollectionTime = 4561730;
> >> Name = ParNew;
> >> Valid = true;
> >> MemoryPoolNames = [ Par Eden Space, Par Survivor Space ];
> >> #mbean = java.lang:type=OperatingSystem:
> >> MaxFileDescriptorCount = 63536;
> >> OpenFileDescriptorCount = 75;
> >> CommittedVirtualMemorySize = 17787711488;
> >> FreePhysicalMemorySize = 45522944;
> >> FreeSwapSpaceSize = 2123968512;
> >> ProcessCpuTime = 12251460000000;
> >> TotalPhysicalMemorySize = 8364417024;
> >> TotalSwapSpaceSize = 4294959104;
> >> Name = Linux;
> >> AvailableProcessors = 8;
> >> Arch = amd64;
> >> SystemLoadAverage = 4.36;
> >> Version = 2.6.18-164.15.1.el5;
> >> #mbean = java.lang:type=Runtime:
> >> Name = 20281@ob1061.nydc1.outbrain.com;
> >>
> >> ClassPath =
> >>
> /outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../build/classes:/outbrain/cassandra/apache-cassandra-0.6.1/bin/..
> >>
> >>
> /lib/antlr-3.1.3.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/apache-cassandra-0.6.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/avro-1.2.0-dev.jar:/outb
> >>
> >>
> rain/cassandra/apache-cassandra-0.6.1/bin/../lib/clhm-production.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-cli-1.1.jar:/outbrain/cassandra/apache-cassandra-
> >>
> >>
> 0.6.1/bin/../lib/commons-codec-1.2.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-collections-3.2.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/com
> >>
> >>
> mons-lang-2.4.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/google-collections-1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/hadoop-core-0.20.1.jar:/out
> >>
> >>
> brain/cassandra/apache-cassandra-0.6.1/bin/../lib/high-scale-lib.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/ivy-2.1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/
> >>
> >>
> bin/../lib/jackson-core-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jackson-mapper-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jline
> >>
> >>
> -0.9.94.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/json-simple-1.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/libthrift-r917130.jar:/outbrain/cassandr
> >>
> >>
> a/apache-cassandra-0.6.1/bin/../lib/log4j-1.2.14.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/slf4j-api-1.5.8.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib
> >> /slf4j-log4j12-1.5.8.jar;
> >>
> >> BootClassPath =
> >>
> /usr/java/jdk1.6.0_17/jre/lib/alt-rt.jar:/usr/java/jdk1.6.0_17/jre/lib/resources.jar:/usr/java/jdk1.6.0_17/jre/lib/rt.jar:/usr/java/jdk1.6.0_17/jre/lib/sunrsasign.j
> >>
> >>
> ar:/usr/java/jdk1.6.0_17/jre/lib/jsse.jar:/usr/java/jdk1.6.0_17/jre/lib/jce.jar:/usr/java/jdk1.6.0_17/jre/lib/charsets.jar:/usr/java/jdk1.6.0_17/jre/classes;
> >>
> >> LibraryPath =
> >>
> /usr/java/jdk1.6.0_17/jre/lib/amd64/server:/usr/java/jdk1.6.0_17/jre/lib/amd64:/usr/java/jdk1.6.0_17/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib;
> >>
> >> VmName = Java HotSpot(TM) 64-Bit Server VM;
> >>
> >> VmVendor = Sun Microsystems Inc.;
> >>
> >> VmVersion = 14.3-b01;
> >>
> >> BootClassPathSupported = true;
> >>
> >> InputArguments = [ -ea, -Xms128M, -Xmx4G, -XX:TargetSurvivorRatio=90,
> >> -XX:+AggressiveOpts, -XX:+UseParNewGC, -XX:+UseConcMarkSweepGC,
> >> -XX:+CMSParallelRemarkEnabled, -XX:+HeapDumpOnOutOfMemoryError,
> >> -XX:SurvivorRatio=128, -XX:MaxTenuringThreshold=0,
> >> -Dcom.sun.management.jmxremote.port=9004,
> >> -Dcom.sun.management.jmxremote.ssl=false,
> >> -Dcom.sun.management.jmxremote.authenticate=false,
> >> -Dstorage-config=/outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf,
> >> -Dcassandra-pidfile=/var/run/cassandra.pid ];
> >>
> >> ManagementSpecVersion = 1.2;
> >>
> >> SpecName = Java Virtual Machine Specification;
> >>
> >> SpecVendor = Sun Microsystems Inc.;
> >>
> >> SpecVersion = 1.0;
> >>
> >> StartTime = 1272911001415;
> >> ...
> >
>

Re: performance tuning - where does the slowness come from?

Posted by Boris Shulman <sh...@gmail.com>.

I think that the extra (more than 4GB) memory usage comes from the
mmaped io, that is why it happens only for reads.

On Tue, May 4, 2010 at 2:02 PM, Jordan Pittier <jo...@gmail.com> wrote:
> I'm facing the same issue with swap. It only occurs when I perform read
> operations (write are very fast :)). So I can't help you with the memory
> probleme.
>
> But to balance the load evenly between nodes in cluster just manually fix
> their token.(the "formula" is i * 2^127 / nb_nodes).
>
> Jordzn
>
> On Tue, May 4, 2010 at 8:20 AM, Ran Tavory <ra...@gmail.com> wrote:
>>
>> I'm looking into performance issues on a 0.6.1 cluster. I see two
>> symptoms:
>> 1. Reads and writes are slow
>> 2. One of the hosts is doing a lot of GC.
>> 1 is slow in the sense that in normal state the cluster used to make
>> around 3-5k read and writes per second (6-10k operations per second), but
>> how it's in the order of 200-400 ops per second, sometimes even less.
>> 2 looks like this:
>> $ tail -f /outbrain/cassandra/log/system.log
>>  INFO [GC inspection] 2010-05-04 00:42:18,636 GCInspector.java (line 110)
>> GC for ParNew: 672 ms, 166482384 reclaimed leaving 2872087208 used; max is
>> 4432068608
>>  INFO [GC inspection] 2010-05-04 00:42:28,638 GCInspector.java (line 110)
>> GC for ParNew: 498 ms, 166493352 reclaimed leaving 2836049448 used; max is
>> 4432068608
>>  INFO [GC inspection] 2010-05-04 00:42:38,640 GCInspector.java (line 110)
>> GC for ParNew: 327 ms, 166091528 reclaimed leaving 2796888424 used; max is
>> 4432068608
>> ... and it goes on and on for hours, no stopping...
>> The cluster is made of 6 hosts, 3 in one DC and 3 in another.
>> Each host has 8G RAM.
>> -Xmx=4G
>> For some reason, the load isn't distributed evenly b/w the hosts, although
>> I'm not sure this is the cause for slowness
>> $ nodetool -h localhost -p 9004 ring
>> Address       Status     Load          Range
>>        Ring
>>
>> 144413773383729447702215082383444206680
>> 192.168.252.99Up         15.94 GB
>>  66002764663998929243644931915471302076     |<--|
>> 192.168.254.57Up         19.84 GB
>>  81288739225600737067856268063987022738     |   ^
>> 192.168.254.58Up         973.78 MB
>> 86999744104066390588161689990810839743     v   |
>> 192.168.252.62Up         5.18 GB
>> 88308919879653155454332084719458267849     |   ^
>> 192.168.254.59Up         10.57 GB
>>  142482163220375328195837946953175033937    v   |
>> 192.168.252.61Up         11.36 GB
>>  144413773383729447702215082383444206680    |-->|
>> The slow host is 192.168.252.61 and it isn't the most loaded one.
>> The host is waiting a lot on IO and the load average is usually 6-7
>> $ w
>>  00:42:56 up 11 days, 13:22,  1 user,  load average: 6.21, 5.52, 3.93
>> $ vmstat 5
>> procs -----------memory---------- ---swap-- -----io---- --system--
>> -----cpu------
>>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id
>> wa st
>>  0  8 2147844  45744   1816 4457384    6    5    66    32    5    2  1  1
>> 96  2  0
>>  0  8 2147164  49020   1808 4451596  385    0  2345    58 3372 9957  2  2
>> 78 18  0
>>  0  3 2146432  45704   1812 4453956  342    0  2274   108 3937 10732  2  2
>> 78 19  0
>>  0  1 2146252  44696   1804 4453436  345  164  1939   294 3647 7833  2  2
>> 78 18  0
>>  0  1 2145960  46924   1744 4451260  158    0  2423   122 4354 14597  2  2
>> 77 18  0
>>  7  1 2138344  44676    952 4504148 1722  403  1722   406 1388  439 87  0
>> 10  2  0
>>  7  2 2137248  45652    956 4499436 1384  655  1384   658 1356  392 87  0
>> 10  3  0
>>  7  1 2135976  46764    956 4495020 1366  718  1366   718 1395  380 87  0
>>  9  4  0
>>  0  8 2134484  46964    956 4489420 1673  555  1814   586 1601 215590 14
>>  2 68 16  0
>>  0  1 2135388  47444    972 4488516  785  833  2390   995 3812 8305  2  2
>> 77 20  0
>>  0 10 2135164  45928    980 4488796  788  543  2275   626 36
>> So, the host is swapping like crazy...
>> top shows that it's using a lot of memory. As noted before -Xmx=4G and
>> nothing else seems to be using a lot of memory on the host except for the
>> cassandra process, however, of the 8G ram on the host, 92% is used by
>> cassandra. How's that?
>> Top shows there's 3.9g Shared and 7.2g Resident and 15.9g Virtual. Why
>> does it have 15g virtual? And why 7.2 RES? This can explain the slowness in
>> swapping.
>> $ top
>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>
>>
>> 20281 cassandr  25   0 15.9g 7.2g 3.9g S 33.3 92.6 175:30.27 java
>> So, can the total memory be controlled?
>> Or perhaps I'm looking in the wrong direction...
>> I've looked at all the cassandra JMX counts and nothing seemed suspicious
>> so far. By suspicious i mean a large number of pending tasks - there were
>> always very small numbers in each pool.
>> About read and write latencies, I'm not sure what the normal state is, but
>> here's an example of what I see on the problematic host:
>> #mbean = org.apache.cassandra.service:type=StorageProxy:
>> RecentReadLatencyMicros = 30105.888180684495;
>> TotalReadLatencyMicros = 78543052801;
>> TotalWriteLatencyMicros = 4213118609;
>> RecentWriteLatencyMicros = 1444.4809201925639;
>> ReadOperations = 4779553;
>> RangeOperations = 0;
>> TotalRangeLatencyMicros = 0;
>> RecentRangeLatencyMicros = NaN;
>> WriteOperations = 4740093;
>> And the only pool that I do see some pending tasks is the ROW-READ-STAGE,
>> but it doesn't look like much, usually around 6-8:
>> #mbean = org.apache.cassandra.concurrent:type=ROW-READ-STAGE:
>> ActiveCount = 8;
>> PendingTasks = 8;
>> CompletedTasks = 5427955;
>> Any help finding the solution is appreciated, thanks...
>> Below are a few more JMXes I collected from the system that may be
>> interesting.
>> #mbean = java.lang:type=Memory:
>> Verbose = false;
>> HeapMemoryUsage = {
>>   committed = 3767279616;
>>   init = 134217728;
>>   max = 4293656576;
>>   used = 1237105080;
>>  };
>> NonHeapMemoryUsage = {
>>   committed = 35061760;
>>   init = 24313856;
>>   max = 138412032;
>>   used = 23151320;
>>  };
>> ObjectPendingFinalizationCount = 0;
>> #mbean = java.lang:name=ParNew,type=GarbageCollector:
>> LastGcInfo = {
>>   GcThreadCount = 11;
>>   duration = 136;
>>   endTime = 42219272;
>>   id = 11719;
>>   memoryUsageAfterGc = {
>>     ( CMS Perm Gen ) = {
>>       key = CMS Perm Gen;
>>       value = {
>>         committed = 29229056;
>>         init = 21757952;
>>         max = 88080384;
>>         used = 17648848;
>>        };
>>      };
>>     ( Code Cache ) = {
>>       key = Code Cache;
>>       value = {
>>         committed = 5832704;
>>         init = 2555904;
>>         max = 50331648;
>>         used = 5563520;
>>        };
>>      };
>>     ( CMS Old Gen ) = {
>>       key = CMS Old Gen;
>>       value = {
>>         committed = 3594133504;
>>         init = 112459776;
>>         max = 4120510464;
>>         used = 964565720;
>>        };
>>      };
>>     ( Par Eden Space ) = {
>>       key = Par Eden Space;
>>       value = {
>>         committed = 171835392;
>>         init = 21495808;
>>         max = 171835392;
>>         used = 0;
>>        };
>>      };
>>     ( Par Survivor Space ) = {
>>       key = Par Survivor Space;
>>       value = {
>>         committed = 1310720;
>>         init = 131072;
>>         max = 1310720;
>>         used = 0;
>>        };
>>      };
>>    };
>>   memoryUsageBeforeGc = {
>>     ( CMS Perm Gen ) = {
>>       key = CMS Perm Gen;
>>       value = {
>>         committed = 29229056;
>>         init = 21757952;
>>         max = 88080384;
>>         used = 17648848;
>>        };
>>      };
>>     ( Code Cache ) = {
>>       key = Code Cache;
>>       value = {
>>         committed = 5832704;
>>         init = 2555904;
>>         max = 50331648;
>>         used = 5563520;
>>        };
>>      };
>>     ( CMS Old Gen ) = {
>>       key = CMS Old Gen;
>>       value = {
>>         committed = 3594133504;
>>         init = 112459776;
>>         max = 4120510464;
>>         used = 959221872;
>>        };
>>      };
>>     ( Par Eden Space ) = {
>>       key = Par Eden Space;
>>       value = {
>>         committed = 171835392;
>>         init = 21495808;
>>         max = 171835392;
>>         used = 171835392;
>>        };
>>      };
>>     ( Par Survivor Space ) = {
>>       key = Par Survivor Space;
>>       value = {
>>         committed = 1310720;
>>         init = 131072;
>>         max = 1310720;
>>         used = 0;
>>        };
>>      };
>>    };
>>   startTime = 42219136;
>>  };
>> CollectionCount = 11720;
>> CollectionTime = 4561730;
>> Name = ParNew;
>> Valid = true;
>> MemoryPoolNames = [ Par Eden Space, Par Survivor Space ];
>> #mbean = java.lang:type=OperatingSystem:
>> MaxFileDescriptorCount = 63536;
>> OpenFileDescriptorCount = 75;
>> CommittedVirtualMemorySize = 17787711488;
>> FreePhysicalMemorySize = 45522944;
>> FreeSwapSpaceSize = 2123968512;
>> ProcessCpuTime = 12251460000000;
>> TotalPhysicalMemorySize = 8364417024;
>> TotalSwapSpaceSize = 4294959104;
>> Name = Linux;
>> AvailableProcessors = 8;
>> Arch = amd64;
>> SystemLoadAverage = 4.36;
>> Version = 2.6.18-164.15.1.el5;
>> #mbean = java.lang:type=Runtime:
>> Name = 20281@ob1061.nydc1.outbrain.com;
>>
>> ClassPath =
>> /outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../build/classes:/outbrain/cassandra/apache-cassandra-0.6.1/bin/..
>>
>> /lib/antlr-3.1.3.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/apache-cassandra-0.6.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/avro-1.2.0-dev.jar:/outb
>>
>> rain/cassandra/apache-cassandra-0.6.1/bin/../lib/clhm-production.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-cli-1.1.jar:/outbrain/cassandra/apache-cassandra-
>>
>> 0.6.1/bin/../lib/commons-codec-1.2.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-collections-3.2.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/com
>>
>> mons-lang-2.4.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/google-collections-1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/hadoop-core-0.20.1.jar:/out
>>
>> brain/cassandra/apache-cassandra-0.6.1/bin/../lib/high-scale-lib.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/ivy-2.1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/
>>
>> bin/../lib/jackson-core-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jackson-mapper-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jline
>>
>> -0.9.94.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/json-simple-1.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/libthrift-r917130.jar:/outbrain/cassandr
>>
>> a/apache-cassandra-0.6.1/bin/../lib/log4j-1.2.14.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/slf4j-api-1.5.8.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib
>> /slf4j-log4j12-1.5.8.jar;
>>
>> BootClassPath =
>> /usr/java/jdk1.6.0_17/jre/lib/alt-rt.jar:/usr/java/jdk1.6.0_17/jre/lib/resources.jar:/usr/java/jdk1.6.0_17/jre/lib/rt.jar:/usr/java/jdk1.6.0_17/jre/lib/sunrsasign.j
>>
>> ar:/usr/java/jdk1.6.0_17/jre/lib/jsse.jar:/usr/java/jdk1.6.0_17/jre/lib/jce.jar:/usr/java/jdk1.6.0_17/jre/lib/charsets.jar:/usr/java/jdk1.6.0_17/jre/classes;
>>
>> LibraryPath =
>> /usr/java/jdk1.6.0_17/jre/lib/amd64/server:/usr/java/jdk1.6.0_17/jre/lib/amd64:/usr/java/jdk1.6.0_17/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib;
>>
>> VmName = Java HotSpot(TM) 64-Bit Server VM;
>>
>> VmVendor = Sun Microsystems Inc.;
>>
>> VmVersion = 14.3-b01;
>>
>> BootClassPathSupported = true;
>>
>> InputArguments = [ -ea, -Xms128M, -Xmx4G, -XX:TargetSurvivorRatio=90,
>> -XX:+AggressiveOpts, -XX:+UseParNewGC, -XX:+UseConcMarkSweepGC,
>> -XX:+CMSParallelRemarkEnabled, -XX:+HeapDumpOnOutOfMemoryError,
>> -XX:SurvivorRatio=128, -XX:MaxTenuringThreshold=0,
>> -Dcom.sun.management.jmxremote.port=9004,
>> -Dcom.sun.management.jmxremote.ssl=false,
>> -Dcom.sun.management.jmxremote.authenticate=false,
>> -Dstorage-config=/outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf,
>> -Dcassandra-pidfile=/var/run/cassandra.pid ];
>>
>> ManagementSpecVersion = 1.2;
>>
>> SpecName = Java Virtual Machine Specification;
>>
>> SpecVendor = Sun Microsystems Inc.;
>>
>> SpecVersion = 1.0;
>>
>> StartTime = 1272911001415;
>> ...
>

Re: performance tuning - where does the slowness come from?

Posted by Jordan Pittier <jo...@gmail.com>.

I'm facing the same issue with swap. It only occurs when I perform read
operations (write are very fast :)). So I can't help you with the memory
probleme.

But to balance the load evenly between nodes in cluster just manually fix
their token.(the "formula" is i * 2^127 / nb_nodes).

Jordzn

On Tue, May 4, 2010 at 8:20 AM, Ran Tavory <ra...@gmail.com> wrote:

> I'm looking into performance issues on a 0.6.1 cluster. I see two symptoms:
> 1. Reads and writes are slow
> 2. One of the hosts is doing a lot of GC.
>
> 1 is slow in the sense that in normal state the cluster used to make around
> 3-5k read and writes per second (6-10k operations per second), but how it's
> in the order of 200-400 ops per second, sometimes even less.
> 2 looks like this:
> $ tail -f /outbrain/cassandra/log/system.log
>  INFO [GC inspection] 2010-05-04 00:42:18,636 GCInspector.java (line 110)
> GC for ParNew: 672 ms, 166482384 reclaimed leaving 2872087208 used; max is
> 4432068608
>  INFO [GC inspection] 2010-05-04 00:42:28,638 GCInspector.java (line 110)
> GC for ParNew: 498 ms, 166493352 reclaimed leaving 2836049448 used; max is
> 4432068608
>  INFO [GC inspection] 2010-05-04 00:42:38,640 GCInspector.java (line 110)
> GC for ParNew: 327 ms, 166091528 reclaimed leaving 2796888424 used; max is
> 4432068608
> ... and it goes on and on for hours, no stopping...
>
> The cluster is made of 6 hosts, 3 in one DC and 3 in another.
> Each host has 8G RAM.
> -Xmx=4G
>
> For some reason, the load isn't distributed evenly b/w the hosts, although
> I'm not sure this is the cause for slowness
> $ nodetool -h localhost -p 9004 ring
> Address       Status     Load          Range
>        Ring
>
> 144413773383729447702215082383444206680
> 192.168.252.99Up         15.94 GB
>  66002764663998929243644931915471302076     |<--|
> 192.168.254.57Up         19.84 GB
>  81288739225600737067856268063987022738     |   ^
> 192.168.254.58Up         973.78 MB
> 86999744104066390588161689990810839743     v   |
> 192.168.252.62Up         5.18 GB
> 88308919879653155454332084719458267849     |   ^
> 192.168.254.59Up         10.57 GB
>  142482163220375328195837946953175033937    v   |
> 192.168.252.61Up         11.36 GB
>  144413773383729447702215082383444206680    |-->|
>
> The slow host is 192.168.252.61 and it isn't the most loaded one.
>
> The host is waiting a lot on IO and the load average is usually 6-7
> $ w
>  00:42:56 up 11 days, 13:22,  1 user,  load average: 6.21, 5.52, 3.93
>
> $ vmstat 5
> procs -----------memory---------- ---swap-- -----io---- --system--
> -----cpu------
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id
> wa st
>  0  8 2147844  45744   1816 4457384    6    5    66    32    5    2  1  1
> 96  2  0
>  0  8 2147164  49020   1808 4451596  385    0  2345    58 3372 9957  2  2
> 78 18  0
>  0  3 2146432  45704   1812 4453956  342    0  2274   108 3937 10732  2  2
> 78 19  0
>  0  1 2146252  44696   1804 4453436  345  164  1939   294 3647 7833  2  2
> 78 18  0
>  0  1 2145960  46924   1744 4451260  158    0  2423   122 4354 14597  2  2
> 77 18  0
>  7  1 2138344  44676    952 4504148 1722  403  1722   406 1388  439 87  0
> 10  2  0
>  7  2 2137248  45652    956 4499436 1384  655  1384   658 1356  392 87  0
> 10  3  0
>  7  1 2135976  46764    956 4495020 1366  718  1366   718 1395  380 87  0
>  9  4  0
>  0  8 2134484  46964    956 4489420 1673  555  1814   586 1601 215590 14
>  2 68 16  0
>  0  1 2135388  47444    972 4488516  785  833  2390   995 3812 8305  2  2
> 77 20  0
>  0 10 2135164  45928    980 4488796  788  543  2275   626 36
>
> So, the host is swapping like crazy...
>
> top shows that it's using a lot of memory. As noted before -Xmx=4G and
> nothing else seems to be using a lot of memory on the host except for the
> cassandra process, however, of the 8G ram on the host, 92% is used by
> cassandra. How's that?
> Top shows there's 3.9g Shared and 7.2g Resident and *15.9g Virtual*. Why
> does it have 15g virtual? And why 7.2 RES? This can explain the slowness in
> swapping.
>
> $ top
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>
>
> 20281 cassandr  25   0 15.9g 7.2g 3.9g S 33.3 92.6 175:30.27 java
>
> So, can the total memory be controlled?
> Or perhaps I'm looking in the wrong direction...
>
> I've looked at all the cassandra JMX counts and nothing seemed suspicious
> so far. By suspicious i mean a large number of pending tasks - there were
> always very small numbers in each pool.
> About read and write latencies, I'm not sure what the normal state is, but
> here's an example of what I see on the problematic host:
>
> #mbean = org.apache.cassandra.service:type=StorageProxy:
> RecentReadLatencyMicros = 30105.888180684495;
> TotalReadLatencyMicros = 78543052801;
> TotalWriteLatencyMicros = 4213118609;
> RecentWriteLatencyMicros = 1444.4809201925639;
> ReadOperations = 4779553;
> RangeOperations = 0;
> TotalRangeLatencyMicros = 0;
> RecentRangeLatencyMicros = NaN;
> WriteOperations = 4740093;
>
> And the only pool that I do see some pending tasks is the ROW-READ-STAGE,
> but it doesn't look like much, usually around 6-8:
> #mbean = org.apache.cassandra.concurrent:type=ROW-READ-STAGE:
> ActiveCount = 8;
> PendingTasks = 8;
> CompletedTasks = 5427955;
>
> Any help finding the solution is appreciated, thanks...
>
> Below are a few more JMXes I collected from the system that may be
> interesting.
>
> #mbean = java.lang:type=Memory:
> Verbose = false;
>
> HeapMemoryUsage = {
>   committed = 3767279616;
>   init = 134217728;
>   max = 4293656576;
>   used = 1237105080;
>  };
>
> NonHeapMemoryUsage = {
>   committed = 35061760;
>   init = 24313856;
>   max = 138412032;
>   used = 23151320;
>  };
>
> ObjectPendingFinalizationCount = 0;
>
> #mbean = java.lang:name=ParNew,type=GarbageCollector:
> LastGcInfo = {
>   GcThreadCount = 11;
>   duration = 136;
>   endTime = 42219272;
>   id = 11719;
>   memoryUsageAfterGc = {
>     ( CMS Perm Gen ) = {
>       key = CMS Perm Gen;
>       value = {
>         committed = 29229056;
>         init = 21757952;
>         max = 88080384;
>         used = 17648848;
>        };
>      };
>     ( Code Cache ) = {
>       key = Code Cache;
>       value = {
>         committed = 5832704;
>         init = 2555904;
>         max = 50331648;
>         used = 5563520;
>        };
>      };
>     ( CMS Old Gen ) = {
>       key = CMS Old Gen;
>       value = {
>         committed = 3594133504;
>         init = 112459776;
>         max = 4120510464;
>         used = 964565720;
>        };
>      };
>     ( Par Eden Space ) = {
>       key = Par Eden Space;
>       value = {
>         committed = 171835392;
>         init = 21495808;
>         max = 171835392;
>         used = 0;
>        };
>      };
>     ( Par Survivor Space ) = {
>       key = Par Survivor Space;
>       value = {
>         committed = 1310720;
>         init = 131072;
>         max = 1310720;
>         used = 0;
>        };
>      };
>    };
>   memoryUsageBeforeGc = {
>     ( CMS Perm Gen ) = {
>       key = CMS Perm Gen;
>       value = {
>         committed = 29229056;
>         init = 21757952;
>         max = 88080384;
>         used = 17648848;
>        };
>      };
>     ( Code Cache ) = {
>       key = Code Cache;
>       value = {
>         committed = 5832704;
>         init = 2555904;
>         max = 50331648;
>         used = 5563520;
>        };
>      };
>     ( CMS Old Gen ) = {
>       key = CMS Old Gen;
>       value = {
>         committed = 3594133504;
>         init = 112459776;
>         max = 4120510464;
>         used = 959221872;
>        };
>      };
>     ( Par Eden Space ) = {
>       key = Par Eden Space;
>       value = {
>         committed = 171835392;
>         init = 21495808;
>         max = 171835392;
>         used = 171835392;
>        };
>      };
>     ( Par Survivor Space ) = {
>       key = Par Survivor Space;
>       value = {
>         committed = 1310720;
>         init = 131072;
>         max = 1310720;
>         used = 0;
>        };
>      };
>    };
>   startTime = 42219136;
>  };
> CollectionCount = 11720;
> CollectionTime = 4561730;
> Name = ParNew;
> Valid = true;
> MemoryPoolNames = [ Par Eden Space, Par Survivor Space ];
>
> #mbean = java.lang:type=OperatingSystem:
> MaxFileDescriptorCount = 63536;
> OpenFileDescriptorCount = 75;
> CommittedVirtualMemorySize = 17787711488;
> FreePhysicalMemorySize = 45522944;
> FreeSwapSpaceSize = 2123968512;
> ProcessCpuTime = 12251460000000;
> TotalPhysicalMemorySize = 8364417024;
> TotalSwapSpaceSize = 4294959104;
> Name = Linux;
> AvailableProcessors = 8;
> Arch = amd64;
> SystemLoadAverage = 4.36;
> Version = 2.6.18-164.15.1.el5;
>
> #mbean = java.lang:type=Runtime:
> Name = 20281@ob1061.nydc1.outbrain.com;
>
> ClassPath =
> /outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../build/classes:/outbrain/cassandra/apache-cassandra-0.6.1/bin/..
>
> /lib/antlr-3.1.3.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/apache-cassandra-0.6.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/avro-1.2.0-dev.jar:/outb
>
> rain/cassandra/apache-cassandra-0.6.1/bin/../lib/clhm-production.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-cli-1.1.jar:/outbrain/cassandra/apache-cassandra-
>
> 0.6.1/bin/../lib/commons-codec-1.2.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/commons-collections-3.2.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/com
>
> mons-lang-2.4.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/google-collections-1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/hadoop-core-0.20.1.jar:/out
>
> brain/cassandra/apache-cassandra-0.6.1/bin/../lib/high-scale-lib.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/ivy-2.1.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/
>
> bin/../lib/jackson-core-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jackson-mapper-asl-1.4.0.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/jline
>
> -0.9.94.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/json-simple-1.1.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/libthrift-r917130.jar:/outbrain/cassandr
>
> a/apache-cassandra-0.6.1/bin/../lib/log4j-1.2.14.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib/slf4j-api-1.5.8.jar:/outbrain/cassandra/apache-cassandra-0.6.1/bin/../lib
> /slf4j-log4j12-1.5.8.jar;
>
> BootClassPath =
> /usr/java/jdk1.6.0_17/jre/lib/alt-rt.jar:/usr/java/jdk1.6.0_17/jre/lib/resources.jar:/usr/java/jdk1.6.0_17/jre/lib/rt.jar:/usr/java/jdk1.6.0_17/jre/lib/sunrsasign.j
>
> ar:/usr/java/jdk1.6.0_17/jre/lib/jsse.jar:/usr/java/jdk1.6.0_17/jre/lib/jce.jar:/usr/java/jdk1.6.0_17/jre/lib/charsets.jar:/usr/java/jdk1.6.0_17/jre/classes;
>
> LibraryPath =
> /usr/java/jdk1.6.0_17/jre/lib/amd64/server:/usr/java/jdk1.6.0_17/jre/lib/amd64:/usr/java/jdk1.6.0_17/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib;
>
> VmName = Java HotSpot(TM) 64-Bit Server VM;
>
> VmVendor = Sun Microsystems Inc.;
>
> VmVersion = 14.3-b01;
>
> BootClassPathSupported = true;
>
> InputArguments = [ -ea, -Xms128M, -Xmx4G, -XX:TargetSurvivorRatio=90,
> -XX:+AggressiveOpts, -XX:+UseParNewGC, -XX:+UseConcMarkSweepGC,
> -XX:+CMSParallelRemarkEnabled, -XX:+HeapDumpOnOutOfMemoryError,
> -XX:SurvivorRatio=128, -XX:MaxTenuringThreshold=0,
> -Dcom.sun.management.jmxremote.port=9004,
> -Dcom.sun.management.jmxremote.ssl=false,
> -Dcom.sun.management.jmxremote.authenticate=false,
> -Dstorage-config=/outbrain/cassandra/apache-cassandra-0.6.1/bin/../conf,
> -Dcassandra-pidfile=/var/run/cassandra.pid ];
>
> ManagementSpecVersion = 1.2;
>
> SpecName = Java Virtual Machine Specification;
>
> SpecVendor = Sun Microsystems Inc.;
>
> SpecVersion = 1.0;
>
> StartTime = 1272911001415;
> ...
>