You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Eric Pederson <er...@gmail.com> on 2017/05/22 21:02:29 UTC

Bottleneck for small inserts?

Hi all:

I'm new to Cassandra and I'm doing some performance testing.  One of things
that I'm testing is ingestion throughput.   My server setup is:

   - 3 node cluster
   - SSD data (both commit log and sstables are on the same disk)
   - 64 GB RAM per server
   - 48 cores per server
   - Cassandra 3.0.11
   - 48 Gb heap using G1GC
   - 1 Gbps NICs

Since I'm using SSD I've tried tuning the following (one at a time) but
none seemed to make a lot of difference:

   - concurrent_writes=384
   - memtable_flush_writers=8
   - concurrent_compactors=8

I am currently doing ingestion tests sending data from 3 clients on the
same subnet.  I am using cassandra-stress to do some ingestion testing.
The tests are using CL=ONE and RF=2.

Using cassandra-stress (3.10) I am able to saturate the disk using a large
enough column size and the standard five column cassandra-stress schema.
For example, -col size=fixed(400) will saturate the disk and compactions
will start falling behind.

One of our main tables has a row size that approximately 200 bytes, across
64 columns.  When ingesting this table I don't see any resource
saturation.  Disk utilization is around 10-15% per iostat.  Incoming
network traffic on the servers is around 100-300 Mbps.  CPU utilization is
around 20-70%.  nodetool tpstats shows mostly zeros with occasional spikes
around 500 in MutationStage.

The stress run does 10,000,000 inserts per client, each with a separate
range of partition IDs.  The run with 200 byte rows takes about 4 minutes,
with mean Latency 4.5ms, Total GC time of 21 secs, Avg GC time 173 ms.

The overall performance is good - around 120k rows/sec ingested.  But I'm
curious to know where the bottleneck is.  There's no resource saturation and
nodetool tpstats shows only occasional brief queueing.  Is the rest just
expected latency inside of Cassandra?

Thanks,

-- Eric

Re: Bottleneck for small inserts?

Posted by Eric Pederson <er...@gmail.com>.

Here are a couple of iostat snapshots showing the spikes in disk queue size
(in these cases correlating with spikes in w/s and %util)

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util

sda               0.00     5.63    0.00    2.33     0.00    63.73
27.31     0.00    0.57   0.41   0.10

sdb               0.00     0.00   48.03 17990.63  3679.73 143925.07
8.18    23.39    1.30   0.01  22.57

dm-0              0.00     0.00    0.00    0.30     0.00     2.40
8.00     0.00    2.00   0.67   0.02

dm-2              0.00     0.00   48.03 17990.63  3679.73 143925.07
8.18    23.56    1.30   0.01  22.83

dm-3              0.00     0.00    0.00    7.67     0.00    61.33
8.00     0.00    0.44   0.10   0.08



Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util

sda               0.00     2.10    0.00    1.33     0.00    27.47
20.60     0.00    0.25   0.18   0.02

sdb               0.00 16309.00  109.43 2714.23  2609.87 152186.40
54.82    11.44    4.05   0.08  23.54

dm-0              0.00     0.00    0.00    0.10     0.00     0.80
8.00     0.00    0.00   0.00   0.00

dm-2              0.00     0.00  109.43 19023.30  2609.87 152186.40
8.09   273.89   14.30   0.01  23.64

dm-3              0.00     0.00    0.00    3.33     0.00    26.67
8.00     0.00    0.25   0.07   0.02


-- Eric

On Wed, Jun 14, 2017 at 11:17 PM, Eric Pederson <er...@gmail.com> wrote:

> Using cassandra-stress with the out of the box schema I am seeing around
> 140k rows/second throughput using 1 client on each of 3 client machines.
> On the servers:
>
>    - CPU utilization: 43% usr/20% sys, 55%/28%, 70%/10% (the last number
>    is the older box)
>    - Inbound network traffic: 174 Mbps, 190 Mbps, 178 Mbps
>    - Disk writes/sec: ~10k each server
>    - Disk utilization is in the low single digits but spikes up to 50%
>    - Disk queue size is in the low single digits but spikes up into the
>    mid hundreds.  I even saw in the thousands.   I had not noticed this
>    before.
>
> The disk stats come from iostat -xz 1.   Given the low reported
> utilization %s I would not expect to see any disk queue buildup, even low
> single digits.
>
> Going to 2 cassandra-stress clients per machine the throughput dropped to
> 133k rows/sec.
>
>    - CPU utilization: 13% usr/5% sys, 15%/25%, 40%/22% on the older box
>    - Inbound network RX: 100Mbps, 125Mbps, 120Mbps
>    - Disk utilization is a little lower, but with the same spiky behavior
>
> Going to 3 cassandra-stress clients per machine the throughput dropped to
> 110k rows/sec
>
>    - CPU utilization: 15% usr/20% sys,  15%/20%, 40%/20% on the older box
>    - Inbound network RX dropped to 130 Mbps
>    - Disk utilization stayed roughly the same
>
> I noticed that with the standard cassandra-stress schema GC is not an
> issue.   But with my application-specific schema there is a lot of GC on
> the slower box.  Also with the application-specific schema I can't seem to
> get past 36k rows/sec.   The application schema has 64 columns (mostly
> ints) and the key is (date,sequence#).   The standard stress schema has a
> lot fewer columns and no clustering column.
>
> Thanks,
>
>
>
> -- Eric
>
> On Wed, Jun 14, 2017 at 1:47 AM, Eric Pederson <er...@gmail.com> wrote:
>
>> Shoot - I didn't see that one.  I subscribe to the digest but was
>> focusing on the direct replies and accidentally missed Patrick and Jeff
>> Jirsa's messages.  Sorry about that...
>>
>> I've been using a combination of cassandra-stress, cqlsh COPY FROM and a
>> custom C++ application for my ingestion testing.   My default setting for
>> my custom client application is 96 threads, and then by default I run one
>> client application process on each of 3 machines.  I tried
>> doubling/quadrupling the number of client threads (and doubling/tripling
>> the number of client processes but keeping the threads per process the
>> same) but didn't see any change.   If I recall correctly I started getting
>> timeouts after I went much beyond concurrent_writes which is 384 (for a 48
>> CPU box) - meaning at 500 threads per client machine I started seeing
>> timeouts.    I'll try again to be sure.
>>
>> For the purposes of this conversation I will try to always use
>> cassandra-stress to keep the number of unknowns limited.  I'll will run
>> more cassandra-stress clients tomorrow in line with Patrick's 3-5 per
>> server recommendation.
>>
>> Thanks!
>>
>>
>> -- Eric
>>
>> On Wed, Jun 14, 2017 at 12:40 AM, Jonathan Haddad <jo...@jonhaddad.com>
>> wrote:
>>
>>> Did you try adding more client stress nodes as Patrick recommended?
>>>
>>> On Tue, Jun 13, 2017 at 9:31 PM Eric Pederson <er...@gmail.com> wrote:
>>>
>>>> Scratch that theory - the flamegraphs show that GC is only 3-4% of two
>>>> newer machine's overall processing, compared to 18% on the slow machine.
>>>>
>>>> I took that machine out of the cluster completely and recreated the
>>>> keyspaces.  The ingest tests now run slightly faster (!).   I would have
>>>> expected a linear slowdown since the load is fairly balanced across
>>>> partitions.  GC appears to be the bottleneck in the 3-server
>>>> configuration.  But still in the two-server configuration the
>>>> CPU/disk/network is still not being fully utilized (the closest is CPU at
>>>> ~45% on one ingest test).  nodetool tpstats shows only blips of
>>>> queueing.
>>>>
>>>>
>>>>
>>>>
>>>> -- Eric
>>>>
>>>> On Mon, Jun 12, 2017 at 9:50 PM, Eric Pederson <er...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all - I wanted to follow up on this.  I'm happy with the throughput
>>>>> we're getting but I'm still curious about the bottleneck.
>>>>>
>>>>> The big thing that sticks out is one of the nodes is logging frequent
>>>>> GCInspector messages: 350-500ms every 3-6 seconds.  All three nodes
>>>>> in the cluster have identical Cassandra configuration, but the node that is
>>>>> logging frequent GCs is an older machine with slower CPU and SSD.  This
>>>>> node logs frequent GCInspectors both under load and when compacting
>>>>> but otherwise unloaded.
>>>>>
>>>>> My theory is that the other two nodes have similar GC frequency
>>>>> (because they are seeing the same basic load), but because they are faster
>>>>> machines, they don't spend as much time per GC and don't cross the
>>>>> GCInspector threshold.  Does that sound plausible?   nodetool tpstats
>>>>> doesn't show any queueing in the system.
>>>>>
>>>>> Here's flamegraphs from the system when running a cqlsh COPY FROM:
>>>>>
>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>    /flamegraph_ultva01_cars_batch2.svg
>>>>>    <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_cars_batch2.svg>
>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>    /flamegraph_ultva02_cars_batch2.svg
>>>>>    <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_cars_batch2.svg>
>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>    /flamegraph_ultva03_cars_batch2.svg
>>>>>    <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_cars_batch2.svg>
>>>>>
>>>>> The slow node (ultva03) spends disproportional time in GC.
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>> -- Eric
>>>>>
>>>>> On Thu, May 25, 2017 at 8:09 PM, Eric Pederson <er...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Due to a cut and paste error those flamegraphs were a recording of
>>>>>> the whole system, not just Cassandra.    Throughput is approximately 30k
>>>>>> rows/sec.
>>>>>>
>>>>>> Here's the graphs with just the Cassandra PID:
>>>>>>
>>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>>    /flamegraph_ultva01_sars2.svg
>>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>>    /flamegraph_ultva02_sars2.svg
>>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>>    /flamegraph_ultva03_sars2.svg
>>>>>>
>>>>>>
>>>>>> And here's graphs during a cqlsh COPY FROM to the same table, using
>>>>>> real data, MAXBATCHSIZE=2.    Throughput is good at approximately
>>>>>> 110k rows/sec.
>>>>>>
>>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>>    /flamegraph_ultva01_cars_batch2.svg
>>>>>>    <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_cars_batch2.svg>
>>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>>    /flamegraph_ultva02_cars_batch2.svg
>>>>>>    <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_cars_batch2.svg>
>>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>>    /flamegraph_ultva03_cars_batch2.svg
>>>>>>    <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_cars_batch2.svg>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -- Eric
>>>>>>
>>>>>> On Thu, May 25, 2017 at 6:44 PM, Eric Pederson <er...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Totally understood :)
>>>>>>>
>>>>>>> I forgot to mention - I set the /proc/irq/*/smp_affinity mask to
>>>>>>> include all of the CPUs.  Actually most of them were set that way already
>>>>>>> (for example, 0000ffff,ffffffff) - it might be because irqbalanced
>>>>>>> is running.  But for some reason the interrupts are all being handled on
>>>>>>> CPU 0 anyway.
>>>>>>>
>>>>>>> I see this in /var/log/dmesg on the machines:
>>>>>>>
>>>>>>>>
>>>>>>>> Your BIOS has requested that x2apic be disabled.
>>>>>>>> This will leave your machine vulnerable to irq-injection attacks.
>>>>>>>> Use 'intremap=no_x2apic_optout' to override BIOS request.
>>>>>>>> Enabled IRQ remapping in xapic mode
>>>>>>>> x2apic not enabled, IRQ remapping is in xapic mode
>>>>>>>
>>>>>>>
>>>>>>> In a reply to one of the comments, he says:
>>>>>>>
>>>>>>>
>>>>>>> When IO-APIC configured to spread interrupts among all cores, it can
>>>>>>>> handle up to eight cores. If you have more than eight cores, kernel will
>>>>>>>> not configure IO-APIC to spread interrupts. Thus the trick I described in
>>>>>>>> the article will not work.
>>>>>>>> Otherwise it may be caused by buggy BIOS or even buggy hardware.
>>>>>>>
>>>>>>>
>>>>>>> I'm not sure if either of them is relevant to my situation.
>>>>>>>
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -- Eric
>>>>>>>
>>>>>>> On Thu, May 25, 2017 at 4:16 PM, Jonathan Haddad <jo...@jonhaddad.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> You shouldn't need a kernel recompile.  Check out the section
>>>>>>>> "Simple solution for the problem" in http://www.alexonlinux.com/
>>>>>>>> smp-affinity-and-proper-interrupt-handling-in-linux.  You can
>>>>>>>> balance your requests across up to 8 CPUs.
>>>>>>>>
>>>>>>>> I'll check out the flame graphs in a little bit - in the middle of
>>>>>>>> something and my brain doesn't multitask well :)
>>>>>>>>
>>>>>>>> On Thu, May 25, 2017 at 1:06 PM Eric Pederson <er...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Jonathan -
>>>>>>>>>
>>>>>>>>> It looks like these machines are configured to use CPU 0 for all
>>>>>>>>> I/O interrupts.  I don't think I'm going to get the OK to compile a new
>>>>>>>>> kernel for them to balance the interrupts across CPUs, but to mitigate the
>>>>>>>>> problem I taskset the Cassandra process to run on all CPU except 0.  It
>>>>>>>>> didn't change the performance though.  Let me know if you think it's
>>>>>>>>> crucial that we balance the interrupts across CPUs and I can try to lobby
>>>>>>>>> for a new kernel.
>>>>>>>>>
>>>>>>>>> Here are flamegraphs from each node from a cassandra-stress
>>>>>>>>> ingest into a table representative of the what we are going to be using.
>>>>>>>>> This table is also roughly 200 bytes, with 64 columns and the primary key (date,
>>>>>>>>> sequence_number).  Cassandra-stress was run on 3 separate client
>>>>>>>>> machines.  Using cassandra-stress to write to this table I see
>>>>>>>>> the same thing: neither disk, CPU or network is fully utilized.
>>>>>>>>>
>>>>>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>>>>>    /flamegraph_ultva01_sars.svg
>>>>>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>>>>>    /flamegraph_ultva02_sars.svg
>>>>>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>>>>>    /flamegraph_ultva03_sars.svg
>>>>>>>>>
>>>>>>>>> Re: GC: In the stress run with the parameters above, two of the
>>>>>>>>> three nodes log zero or one GCInspectors.  On the other hand, the
>>>>>>>>> 3rd machine logs a GCInspector every 5 seconds or so, 300-500ms
>>>>>>>>> each time.  I found out that the 3rd machine actually has different specs
>>>>>>>>> as the other two.  It's an older box with the same RAM but less CPUs (32
>>>>>>>>> instead of 48), a slower SSD and slower memory.   The Cassandra
>>>>>>>>> configuration is exactly the same.   I tried running Cassandra with only 32
>>>>>>>>> CPUs on the newer boxes to see if that would cause them to GC pause more,
>>>>>>>>> but it didn't.
>>>>>>>>>
>>>>>>>>> On a separate topic - for this cassandra-stress run I reduced the
>>>>>>>>> batch size to 2 in order to keep the logs clean.  That also reduced the
>>>>>>>>> throughput from around 100k rows/second to 32k rows/sec.  I've been doing
>>>>>>>>> ingestion tests using cassandra-stress, cqlsh COPY FROM and a
>>>>>>>>> custom C++ application.  In most of the tests that I've been doing I've
>>>>>>>>> been using a batch size of around 20 (unlogged, all batch rows have the
>>>>>>>>> same partition key).  However, it fills the logs with batch size warnings.
>>>>>>>>> I was going to raise the batch warning size but the docs scared me away
>>>>>>>>> from doing that.   Given that we're using unlogged/same partition batches
>>>>>>>>> is it safe to raise the batch size warning limit?   Actually cqlsh
>>>>>>>>> COPY FROM has very good throughput using a small batch size, but
>>>>>>>>> I can't get that same throughput in cassandra-stress or my C++ app with a
>>>>>>>>> batch size of 2.
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -- Eric
>>>>>>>>>
>>>>>>>>> On Mon, May 22, 2017 at 5:08 PM, Jonathan Haddad <
>>>>>>>>> jon@jonhaddad.com> wrote:
>>>>>>>>>
>>>>>>>>>> How many CPUs are you using for interrupts?
>>>>>>>>>> http://www.alexonlinux.com/smp-affinity-and-proper-interrup
>>>>>>>>>> t-handling-in-linux
>>>>>>>>>>
>>>>>>>>>> Have you tried making a flame graph to see where Cassandra is
>>>>>>>>>> spending its time? http://www.brendangregg.
>>>>>>>>>> com/blog/2014-06-12/java-flame-graphs.html
>>>>>>>>>>
>>>>>>>>>> Are you tracking GC pauses?
>>>>>>>>>>
>>>>>>>>>> Jon
>>>>>>>>>>
>>>>>>>>>> On Mon, May 22, 2017 at 2:03 PM Eric Pederson <er...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi all:
>>>>>>>>>>>
>>>>>>>>>>> I'm new to Cassandra and I'm doing some performance testing.
>>>>>>>>>>> One of things that I'm testing is ingestion throughput.   My server setup
>>>>>>>>>>> is:
>>>>>>>>>>>
>>>>>>>>>>>    - 3 node cluster
>>>>>>>>>>>    - SSD data (both commit log and sstables are on the same
>>>>>>>>>>>    disk)
>>>>>>>>>>>    - 64 GB RAM per server
>>>>>>>>>>>    - 48 cores per server
>>>>>>>>>>>    - Cassandra 3.0.11
>>>>>>>>>>>    - 48 Gb heap using G1GC
>>>>>>>>>>>    - 1 Gbps NICs
>>>>>>>>>>>
>>>>>>>>>>> Since I'm using SSD I've tried tuning the following (one at a
>>>>>>>>>>> time) but none seemed to make a lot of difference:
>>>>>>>>>>>
>>>>>>>>>>>    - concurrent_writes=384
>>>>>>>>>>>    - memtable_flush_writers=8
>>>>>>>>>>>    - concurrent_compactors=8
>>>>>>>>>>>
>>>>>>>>>>> I am currently doing ingestion tests sending data from 3 clients
>>>>>>>>>>> on the same subnet.  I am using cassandra-stress to do some ingestion
>>>>>>>>>>> testing.  The tests are using CL=ONE and RF=2.
>>>>>>>>>>>
>>>>>>>>>>> Using cassandra-stress (3.10) I am able to saturate the disk
>>>>>>>>>>> using a large enough column size and the standard five column
>>>>>>>>>>> cassandra-stress schema.  For example, -col size=fixed(400)
>>>>>>>>>>> will saturate the disk and compactions will start falling behind.
>>>>>>>>>>>
>>>>>>>>>>> One of our main tables has a row size that approximately 200
>>>>>>>>>>> bytes, across 64 columns.  When ingesting this table I don't see any
>>>>>>>>>>> resource saturation.  Disk utilization is around 10-15% per
>>>>>>>>>>> iostat.  Incoming network traffic on the servers is around
>>>>>>>>>>> 100-300 Mbps.  CPU utilization is around 20-70%.  nodetool
>>>>>>>>>>> tpstats shows mostly zeros with occasional spikes around 500 in
>>>>>>>>>>> MutationStage.
>>>>>>>>>>>
>>>>>>>>>>> The stress run does 10,000,000 inserts per client, each with a
>>>>>>>>>>> separate range of partition IDs.  The run with 200 byte rows takes about 4
>>>>>>>>>>> minutes, with mean Latency 4.5ms, Total GC time of 21 secs, Avg GC time 173
>>>>>>>>>>> ms.
>>>>>>>>>>>
>>>>>>>>>>> The overall performance is good - around 120k rows/sec
>>>>>>>>>>> ingested.  But I'm curious to know where the bottleneck is.  There's no
>>>>>>>>>>> resource saturation and nodetool tpstats shows only occasional
>>>>>>>>>>> brief queueing.  Is the rest just expected latency inside of Cassandra?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> -- Eric
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>
>

Re: Bottleneck for small inserts?

Posted by Eric Pederson <er...@gmail.com>.

Using cassandra-stress with the out of the box schema I am seeing around
140k rows/second throughput using 1 client on each of 3 client machines.
On the servers:

   - CPU utilization: 43% usr/20% sys, 55%/28%, 70%/10% (the last number is
   the older box)
   - Inbound network traffic: 174 Mbps, 190 Mbps, 178 Mbps
   - Disk writes/sec: ~10k each server
   - Disk utilization is in the low single digits but spikes up to 50%
   - Disk queue size is in the low single digits but spikes up into the mid
   hundreds.  I even saw in the thousands.   I had not noticed this before.

The disk stats come from iostat -xz 1.   Given the low reported utilization
%s I would not expect to see any disk queue buildup, even low single digits.

Going to 2 cassandra-stress clients per machine the throughput dropped to
133k rows/sec.

   - CPU utilization: 13% usr/5% sys, 15%/25%, 40%/22% on the older box
   - Inbound network RX: 100Mbps, 125Mbps, 120Mbps
   - Disk utilization is a little lower, but with the same spiky behavior

Going to 3 cassandra-stress clients per machine the throughput dropped to
110k rows/sec

   - CPU utilization: 15% usr/20% sys,  15%/20%, 40%/20% on the older box
   - Inbound network RX dropped to 130 Mbps
   - Disk utilization stayed roughly the same

I noticed that with the standard cassandra-stress schema GC is not an
issue.   But with my application-specific schema there is a lot of GC on
the slower box.  Also with the application-specific schema I can't seem to
get past 36k rows/sec.   The application schema has 64 columns (mostly
ints) and the key is (date,sequence#).   The standard stress schema has a
lot fewer columns and no clustering column.

Thanks,



-- Eric

On Wed, Jun 14, 2017 at 1:47 AM, Eric Pederson <er...@gmail.com> wrote:

> Shoot - I didn't see that one.  I subscribe to the digest but was focusing
> on the direct replies and accidentally missed Patrick and Jeff Jirsa's
> messages.  Sorry about that...
>
> I've been using a combination of cassandra-stress, cqlsh COPY FROM and a
> custom C++ application for my ingestion testing.   My default setting for
> my custom client application is 96 threads, and then by default I run one
> client application process on each of 3 machines.  I tried
> doubling/quadrupling the number of client threads (and doubling/tripling
> the number of client processes but keeping the threads per process the
> same) but didn't see any change.   If I recall correctly I started getting
> timeouts after I went much beyond concurrent_writes which is 384 (for a 48
> CPU box) - meaning at 500 threads per client machine I started seeing
> timeouts.    I'll try again to be sure.
>
> For the purposes of this conversation I will try to always use
> cassandra-stress to keep the number of unknowns limited.  I'll will run
> more cassandra-stress clients tomorrow in line with Patrick's 3-5 per
> server recommendation.
>
> Thanks!
>
>
> -- Eric
>
> On Wed, Jun 14, 2017 at 12:40 AM, Jonathan Haddad <jo...@jonhaddad.com>
> wrote:
>
>> Did you try adding more client stress nodes as Patrick recommended?
>>
>> On Tue, Jun 13, 2017 at 9:31 PM Eric Pederson <er...@gmail.com> wrote:
>>
>>> Scratch that theory - the flamegraphs show that GC is only 3-4% of two
>>> newer machine's overall processing, compared to 18% on the slow machine.
>>>
>>> I took that machine out of the cluster completely and recreated the
>>> keyspaces.  The ingest tests now run slightly faster (!).   I would have
>>> expected a linear slowdown since the load is fairly balanced across
>>> partitions.  GC appears to be the bottleneck in the 3-server
>>> configuration.  But still in the two-server configuration the
>>> CPU/disk/network is still not being fully utilized (the closest is CPU at
>>> ~45% on one ingest test).  nodetool tpstats shows only blips of
>>> queueing.
>>>
>>>
>>>
>>>
>>> -- Eric
>>>
>>> On Mon, Jun 12, 2017 at 9:50 PM, Eric Pederson <er...@gmail.com>
>>> wrote:
>>>
>>>> Hi all - I wanted to follow up on this.  I'm happy with the throughput
>>>> we're getting but I'm still curious about the bottleneck.
>>>>
>>>> The big thing that sticks out is one of the nodes is logging frequent
>>>> GCInspector messages: 350-500ms every 3-6 seconds.  All three nodes in
>>>> the cluster have identical Cassandra configuration, but the node that is
>>>> logging frequent GCs is an older machine with slower CPU and SSD.  This
>>>> node logs frequent GCInspectors both under load and when compacting
>>>> but otherwise unloaded.
>>>>
>>>> My theory is that the other two nodes have similar GC frequency
>>>> (because they are seeing the same basic load), but because they are faster
>>>> machines, they don't spend as much time per GC and don't cross the
>>>> GCInspector threshold.  Does that sound plausible?   nodetool tpstats
>>>> doesn't show any queueing in the system.
>>>>
>>>> Here's flamegraphs from the system when running a cqlsh COPY FROM:
>>>>
>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>    /flamegraph_ultva01_cars_batch2.svg
>>>>    <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_cars_batch2.svg>
>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>    /flamegraph_ultva02_cars_batch2.svg
>>>>    <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_cars_batch2.svg>
>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>    /flamegraph_ultva03_cars_batch2.svg
>>>>    <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_cars_batch2.svg>
>>>>
>>>> The slow node (ultva03) spends disproportional time in GC.
>>>>
>>>> Thanks,
>>>>
>>>>
>>>> -- Eric
>>>>
>>>> On Thu, May 25, 2017 at 8:09 PM, Eric Pederson <er...@gmail.com>
>>>> wrote:
>>>>
>>>>> Due to a cut and paste error those flamegraphs were a recording of the
>>>>> whole system, not just Cassandra.    Throughput is approximately 30k
>>>>> rows/sec.
>>>>>
>>>>> Here's the graphs with just the Cassandra PID:
>>>>>
>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>    /flamegraph_ultva01_sars2.svg
>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>    /flamegraph_ultva02_sars2.svg
>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>    /flamegraph_ultva03_sars2.svg
>>>>>
>>>>>
>>>>> And here's graphs during a cqlsh COPY FROM to the same table, using
>>>>> real data, MAXBATCHSIZE=2.    Throughput is good at approximately
>>>>> 110k rows/sec.
>>>>>
>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>    /flamegraph_ultva01_cars_batch2.svg
>>>>>    <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_cars_batch2.svg>
>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>    /flamegraph_ultva02_cars_batch2.svg
>>>>>    <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_cars_batch2.svg>
>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>    /flamegraph_ultva03_cars_batch2.svg
>>>>>    <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_cars_batch2.svg>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -- Eric
>>>>>
>>>>> On Thu, May 25, 2017 at 6:44 PM, Eric Pederson <er...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Totally understood :)
>>>>>>
>>>>>> I forgot to mention - I set the /proc/irq/*/smp_affinity mask to
>>>>>> include all of the CPUs.  Actually most of them were set that way already
>>>>>> (for example, 0000ffff,ffffffff) - it might be because irqbalanced
>>>>>> is running.  But for some reason the interrupts are all being handled on
>>>>>> CPU 0 anyway.
>>>>>>
>>>>>> I see this in /var/log/dmesg on the machines:
>>>>>>
>>>>>>>
>>>>>>> Your BIOS has requested that x2apic be disabled.
>>>>>>> This will leave your machine vulnerable to irq-injection attacks.
>>>>>>> Use 'intremap=no_x2apic_optout' to override BIOS request.
>>>>>>> Enabled IRQ remapping in xapic mode
>>>>>>> x2apic not enabled, IRQ remapping is in xapic mode
>>>>>>
>>>>>>
>>>>>> In a reply to one of the comments, he says:
>>>>>>
>>>>>>
>>>>>> When IO-APIC configured to spread interrupts among all cores, it can
>>>>>>> handle up to eight cores. If you have more than eight cores, kernel will
>>>>>>> not configure IO-APIC to spread interrupts. Thus the trick I described in
>>>>>>> the article will not work.
>>>>>>> Otherwise it may be caused by buggy BIOS or even buggy hardware.
>>>>>>
>>>>>>
>>>>>> I'm not sure if either of them is relevant to my situation.
>>>>>>
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> -- Eric
>>>>>>
>>>>>> On Thu, May 25, 2017 at 4:16 PM, Jonathan Haddad <jo...@jonhaddad.com>
>>>>>> wrote:
>>>>>>
>>>>>>> You shouldn't need a kernel recompile.  Check out the section
>>>>>>> "Simple solution for the problem" in http://www.alexonlinux.com/
>>>>>>> smp-affinity-and-proper-interrupt-handling-in-linux.  You can
>>>>>>> balance your requests across up to 8 CPUs.
>>>>>>>
>>>>>>> I'll check out the flame graphs in a little bit - in the middle of
>>>>>>> something and my brain doesn't multitask well :)
>>>>>>>
>>>>>>> On Thu, May 25, 2017 at 1:06 PM Eric Pederson <er...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Jonathan -
>>>>>>>>
>>>>>>>> It looks like these machines are configured to use CPU 0 for all
>>>>>>>> I/O interrupts.  I don't think I'm going to get the OK to compile a new
>>>>>>>> kernel for them to balance the interrupts across CPUs, but to mitigate the
>>>>>>>> problem I taskset the Cassandra process to run on all CPU except 0.  It
>>>>>>>> didn't change the performance though.  Let me know if you think it's
>>>>>>>> crucial that we balance the interrupts across CPUs and I can try to lobby
>>>>>>>> for a new kernel.
>>>>>>>>
>>>>>>>> Here are flamegraphs from each node from a cassandra-stress ingest
>>>>>>>> into a table representative of the what we are going to be using.   This
>>>>>>>> table is also roughly 200 bytes, with 64 columns and the primary key (date,
>>>>>>>> sequence_number).  Cassandra-stress was run on 3 separate client
>>>>>>>> machines.  Using cassandra-stress to write to this table I see the
>>>>>>>> same thing: neither disk, CPU or network is fully utilized.
>>>>>>>>
>>>>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>>>>    /flamegraph_ultva01_sars.svg
>>>>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>>>>    /flamegraph_ultva02_sars.svg
>>>>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>>>>    /flamegraph_ultva03_sars.svg
>>>>>>>>
>>>>>>>> Re: GC: In the stress run with the parameters above, two of the
>>>>>>>> three nodes log zero or one GCInspectors.  On the other hand, the
>>>>>>>> 3rd machine logs a GCInspector every 5 seconds or so, 300-500ms
>>>>>>>> each time.  I found out that the 3rd machine actually has different specs
>>>>>>>> as the other two.  It's an older box with the same RAM but less CPUs (32
>>>>>>>> instead of 48), a slower SSD and slower memory.   The Cassandra
>>>>>>>> configuration is exactly the same.   I tried running Cassandra with only 32
>>>>>>>> CPUs on the newer boxes to see if that would cause them to GC pause more,
>>>>>>>> but it didn't.
>>>>>>>>
>>>>>>>> On a separate topic - for this cassandra-stress run I reduced the
>>>>>>>> batch size to 2 in order to keep the logs clean.  That also reduced the
>>>>>>>> throughput from around 100k rows/second to 32k rows/sec.  I've been doing
>>>>>>>> ingestion tests using cassandra-stress, cqlsh COPY FROM and a
>>>>>>>> custom C++ application.  In most of the tests that I've been doing I've
>>>>>>>> been using a batch size of around 20 (unlogged, all batch rows have the
>>>>>>>> same partition key).  However, it fills the logs with batch size warnings.
>>>>>>>> I was going to raise the batch warning size but the docs scared me away
>>>>>>>> from doing that.   Given that we're using unlogged/same partition batches
>>>>>>>> is it safe to raise the batch size warning limit?   Actually cqlsh
>>>>>>>> COPY FROM has very good throughput using a small batch size, but I
>>>>>>>> can't get that same throughput in cassandra-stress or my C++ app with a
>>>>>>>> batch size of 2.
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -- Eric
>>>>>>>>
>>>>>>>> On Mon, May 22, 2017 at 5:08 PM, Jonathan Haddad <jon@jonhaddad.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> How many CPUs are you using for interrupts?
>>>>>>>>> http://www.alexonlinux.com/smp-affinity-and-proper-interrup
>>>>>>>>> t-handling-in-linux
>>>>>>>>>
>>>>>>>>> Have you tried making a flame graph to see where Cassandra is
>>>>>>>>> spending its time? http://www.brendangregg.
>>>>>>>>> com/blog/2014-06-12/java-flame-graphs.html
>>>>>>>>>
>>>>>>>>> Are you tracking GC pauses?
>>>>>>>>>
>>>>>>>>> Jon
>>>>>>>>>
>>>>>>>>> On Mon, May 22, 2017 at 2:03 PM Eric Pederson <er...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi all:
>>>>>>>>>>
>>>>>>>>>> I'm new to Cassandra and I'm doing some performance testing.  One
>>>>>>>>>> of things that I'm testing is ingestion throughput.   My server setup is:
>>>>>>>>>>
>>>>>>>>>>    - 3 node cluster
>>>>>>>>>>    - SSD data (both commit log and sstables are on the same disk)
>>>>>>>>>>    - 64 GB RAM per server
>>>>>>>>>>    - 48 cores per server
>>>>>>>>>>    - Cassandra 3.0.11
>>>>>>>>>>    - 48 Gb heap using G1GC
>>>>>>>>>>    - 1 Gbps NICs
>>>>>>>>>>
>>>>>>>>>> Since I'm using SSD I've tried tuning the following (one at a
>>>>>>>>>> time) but none seemed to make a lot of difference:
>>>>>>>>>>
>>>>>>>>>>    - concurrent_writes=384
>>>>>>>>>>    - memtable_flush_writers=8
>>>>>>>>>>    - concurrent_compactors=8
>>>>>>>>>>
>>>>>>>>>> I am currently doing ingestion tests sending data from 3 clients
>>>>>>>>>> on the same subnet.  I am using cassandra-stress to do some ingestion
>>>>>>>>>> testing.  The tests are using CL=ONE and RF=2.
>>>>>>>>>>
>>>>>>>>>> Using cassandra-stress (3.10) I am able to saturate the disk
>>>>>>>>>> using a large enough column size and the standard five column
>>>>>>>>>> cassandra-stress schema.  For example, -col size=fixed(400) will
>>>>>>>>>> saturate the disk and compactions will start falling behind.
>>>>>>>>>>
>>>>>>>>>> One of our main tables has a row size that approximately 200
>>>>>>>>>> bytes, across 64 columns.  When ingesting this table I don't see any
>>>>>>>>>> resource saturation.  Disk utilization is around 10-15% per
>>>>>>>>>> iostat.  Incoming network traffic on the servers is around
>>>>>>>>>> 100-300 Mbps.  CPU utilization is around 20-70%.  nodetool
>>>>>>>>>> tpstats shows mostly zeros with occasional spikes around 500 in
>>>>>>>>>> MutationStage.
>>>>>>>>>>
>>>>>>>>>> The stress run does 10,000,000 inserts per client, each with a
>>>>>>>>>> separate range of partition IDs.  The run with 200 byte rows takes about 4
>>>>>>>>>> minutes, with mean Latency 4.5ms, Total GC time of 21 secs, Avg GC time 173
>>>>>>>>>> ms.
>>>>>>>>>>
>>>>>>>>>> The overall performance is good - around 120k rows/sec ingested.
>>>>>>>>>> But I'm curious to know where the bottleneck is.  There's no resource
>>>>>>>>>> saturation and nodetool tpstats shows only occasional brief
>>>>>>>>>> queueing.  Is the rest just expected latency inside of Cassandra?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> -- Eric
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>

Re: Bottleneck for small inserts?

Posted by Eric Pederson <er...@gmail.com>.

Shoot - I didn't see that one.  I subscribe to the digest but was focusing
on the direct replies and accidentally missed Patrick and Jeff Jirsa's
messages.  Sorry about that...

I've been using a combination of cassandra-stress, cqlsh COPY FROM and a
custom C++ application for my ingestion testing.   My default setting for
my custom client application is 96 threads, and then by default I run one
client application process on each of 3 machines.  I tried
doubling/quadrupling the number of client threads (and doubling/tripling
the number of client processes but keeping the threads per process the
same) but didn't see any change.   If I recall correctly I started getting
timeouts after I went much beyond concurrent_writes which is 384 (for a 48
CPU box) - meaning at 500 threads per client machine I started seeing
timeouts.    I'll try again to be sure.

For the purposes of this conversation I will try to always use
cassandra-stress to keep the number of unknowns limited.  I'll will run
more cassandra-stress clients tomorrow in line with Patrick's 3-5 per
server recommendation.

Thanks!


-- Eric

On Wed, Jun 14, 2017 at 12:40 AM, Jonathan Haddad <jo...@jonhaddad.com> wrote:

> Did you try adding more client stress nodes as Patrick recommended?
>
> On Tue, Jun 13, 2017 at 9:31 PM Eric Pederson <er...@gmail.com> wrote:
>
>> Scratch that theory - the flamegraphs show that GC is only 3-4% of two
>> newer machine's overall processing, compared to 18% on the slow machine.
>>
>> I took that machine out of the cluster completely and recreated the
>> keyspaces.  The ingest tests now run slightly faster (!).   I would have
>> expected a linear slowdown since the load is fairly balanced across
>> partitions.  GC appears to be the bottleneck in the 3-server
>> configuration.  But still in the two-server configuration the
>> CPU/disk/network is still not being fully utilized (the closest is CPU at
>> ~45% on one ingest test).  nodetool tpstats shows only blips of
>> queueing.
>>
>>
>>
>>
>> -- Eric
>>
>> On Mon, Jun 12, 2017 at 9:50 PM, Eric Pederson <er...@gmail.com> wrote:
>>
>>> Hi all - I wanted to follow up on this.  I'm happy with the throughput
>>> we're getting but I'm still curious about the bottleneck.
>>>
>>> The big thing that sticks out is one of the nodes is logging frequent
>>> GCInspector messages: 350-500ms every 3-6 seconds.  All three nodes in
>>> the cluster have identical Cassandra configuration, but the node that is
>>> logging frequent GCs is an older machine with slower CPU and SSD.  This
>>> node logs frequent GCInspectors both under load and when compacting but
>>> otherwise unloaded.
>>>
>>> My theory is that the other two nodes have similar GC frequency (because
>>> they are seeing the same basic load), but because they are faster machines,
>>> they don't spend as much time per GC and don't cross the GCInspector
>>> threshold.  Does that sound plausible?   nodetool tpstats doesn't show
>>> any queueing in the system.
>>>
>>> Here's flamegraphs from the system when running a cqlsh COPY FROM:
>>>
>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>    /flamegraph_ultva01_cars_batch2.svg
>>>    <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_cars_batch2.svg>
>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>    /flamegraph_ultva02_cars_batch2.svg
>>>    <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_cars_batch2.svg>
>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>    /flamegraph_ultva03_cars_batch2.svg
>>>    <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_cars_batch2.svg>
>>>
>>> The slow node (ultva03) spends disproportional time in GC.
>>>
>>> Thanks,
>>>
>>>
>>> -- Eric
>>>
>>> On Thu, May 25, 2017 at 8:09 PM, Eric Pederson <er...@gmail.com>
>>> wrote:
>>>
>>>> Due to a cut and paste error those flamegraphs were a recording of the
>>>> whole system, not just Cassandra.    Throughput is approximately 30k
>>>> rows/sec.
>>>>
>>>> Here's the graphs with just the Cassandra PID:
>>>>
>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>    /flamegraph_ultva01_sars2.svg
>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>    /flamegraph_ultva02_sars2.svg
>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>    /flamegraph_ultva03_sars2.svg
>>>>
>>>>
>>>> And here's graphs during a cqlsh COPY FROM to the same table, using
>>>> real data, MAXBATCHSIZE=2.    Throughput is good at approximately 110k
>>>> rows/sec.
>>>>
>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>    /flamegraph_ultva01_cars_batch2.svg
>>>>    <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_cars_batch2.svg>
>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>    /flamegraph_ultva02_cars_batch2.svg
>>>>    <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_cars_batch2.svg>
>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>    /flamegraph_ultva03_cars_batch2.svg
>>>>    <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_cars_batch2.svg>
>>>>
>>>>
>>>>
>>>>
>>>> -- Eric
>>>>
>>>> On Thu, May 25, 2017 at 6:44 PM, Eric Pederson <er...@gmail.com>
>>>> wrote:
>>>>
>>>>> Totally understood :)
>>>>>
>>>>> I forgot to mention - I set the /proc/irq/*/smp_affinity mask to
>>>>> include all of the CPUs.  Actually most of them were set that way already
>>>>> (for example, 0000ffff,ffffffff) - it might be because irqbalanced is
>>>>> running.  But for some reason the interrupts are all being handled on CPU 0
>>>>> anyway.
>>>>>
>>>>> I see this in /var/log/dmesg on the machines:
>>>>>
>>>>>>
>>>>>> Your BIOS has requested that x2apic be disabled.
>>>>>> This will leave your machine vulnerable to irq-injection attacks.
>>>>>> Use 'intremap=no_x2apic_optout' to override BIOS request.
>>>>>> Enabled IRQ remapping in xapic mode
>>>>>> x2apic not enabled, IRQ remapping is in xapic mode
>>>>>
>>>>>
>>>>> In a reply to one of the comments, he says:
>>>>>
>>>>>
>>>>> When IO-APIC configured to spread interrupts among all cores, it can
>>>>>> handle up to eight cores. If you have more than eight cores, kernel will
>>>>>> not configure IO-APIC to spread interrupts. Thus the trick I described in
>>>>>> the article will not work.
>>>>>> Otherwise it may be caused by buggy BIOS or even buggy hardware.
>>>>>
>>>>>
>>>>> I'm not sure if either of them is relevant to my situation.
>>>>>
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -- Eric
>>>>>
>>>>> On Thu, May 25, 2017 at 4:16 PM, Jonathan Haddad <jo...@jonhaddad.com>
>>>>> wrote:
>>>>>
>>>>>> You shouldn't need a kernel recompile.  Check out the section "Simple
>>>>>> solution for the problem" in http://www.alexonlinux.com/
>>>>>> smp-affinity-and-proper-interrupt-handling-in-linux.  You can
>>>>>> balance your requests across up to 8 CPUs.
>>>>>>
>>>>>> I'll check out the flame graphs in a little bit - in the middle of
>>>>>> something and my brain doesn't multitask well :)
>>>>>>
>>>>>> On Thu, May 25, 2017 at 1:06 PM Eric Pederson <er...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Jonathan -
>>>>>>>
>>>>>>> It looks like these machines are configured to use CPU 0 for all I/O
>>>>>>> interrupts.  I don't think I'm going to get the OK to compile a new kernel
>>>>>>> for them to balance the interrupts across CPUs, but to mitigate the problem
>>>>>>> I taskset the Cassandra process to run on all CPU except 0.  It didn't
>>>>>>> change the performance though.  Let me know if you think it's crucial that
>>>>>>> we balance the interrupts across CPUs and I can try to lobby for a new
>>>>>>> kernel.
>>>>>>>
>>>>>>> Here are flamegraphs from each node from a cassandra-stress ingest
>>>>>>> into a table representative of the what we are going to be using.   This
>>>>>>> table is also roughly 200 bytes, with 64 columns and the primary key (date,
>>>>>>> sequence_number).  Cassandra-stress was run on 3 separate client
>>>>>>> machines.  Using cassandra-stress to write to this table I see the
>>>>>>> same thing: neither disk, CPU or network is fully utilized.
>>>>>>>
>>>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>>>    /flamegraph_ultva01_sars.svg
>>>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>>>    /flamegraph_ultva02_sars.svg
>>>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>>>    /flamegraph_ultva03_sars.svg
>>>>>>>
>>>>>>> Re: GC: In the stress run with the parameters above, two of the
>>>>>>> three nodes log zero or one GCInspectors.  On the other hand, the
>>>>>>> 3rd machine logs a GCInspector every 5 seconds or so, 300-500ms
>>>>>>> each time.  I found out that the 3rd machine actually has different specs
>>>>>>> as the other two.  It's an older box with the same RAM but less CPUs (32
>>>>>>> instead of 48), a slower SSD and slower memory.   The Cassandra
>>>>>>> configuration is exactly the same.   I tried running Cassandra with only 32
>>>>>>> CPUs on the newer boxes to see if that would cause them to GC pause more,
>>>>>>> but it didn't.
>>>>>>>
>>>>>>> On a separate topic - for this cassandra-stress run I reduced the
>>>>>>> batch size to 2 in order to keep the logs clean.  That also reduced the
>>>>>>> throughput from around 100k rows/second to 32k rows/sec.  I've been doing
>>>>>>> ingestion tests using cassandra-stress, cqlsh COPY FROM and a
>>>>>>> custom C++ application.  In most of the tests that I've been doing I've
>>>>>>> been using a batch size of around 20 (unlogged, all batch rows have the
>>>>>>> same partition key).  However, it fills the logs with batch size warnings.
>>>>>>> I was going to raise the batch warning size but the docs scared me away
>>>>>>> from doing that.   Given that we're using unlogged/same partition batches
>>>>>>> is it safe to raise the batch size warning limit?   Actually cqlsh
>>>>>>> COPY FROM has very good throughput using a small batch size, but I
>>>>>>> can't get that same throughput in cassandra-stress or my C++ app with a
>>>>>>> batch size of 2.
>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -- Eric
>>>>>>>
>>>>>>> On Mon, May 22, 2017 at 5:08 PM, Jonathan Haddad <jo...@jonhaddad.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> How many CPUs are you using for interrupts?
>>>>>>>> http://www.alexonlinux.com/smp-affinity-and-proper-interrup
>>>>>>>> t-handling-in-linux
>>>>>>>>
>>>>>>>> Have you tried making a flame graph to see where Cassandra is
>>>>>>>> spending its time? http://www.brendangregg.
>>>>>>>> com/blog/2014-06-12/java-flame-graphs.html
>>>>>>>>
>>>>>>>> Are you tracking GC pauses?
>>>>>>>>
>>>>>>>> Jon
>>>>>>>>
>>>>>>>> On Mon, May 22, 2017 at 2:03 PM Eric Pederson <er...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi all:
>>>>>>>>>
>>>>>>>>> I'm new to Cassandra and I'm doing some performance testing.  One
>>>>>>>>> of things that I'm testing is ingestion throughput.   My server setup is:
>>>>>>>>>
>>>>>>>>>    - 3 node cluster
>>>>>>>>>    - SSD data (both commit log and sstables are on the same disk)
>>>>>>>>>    - 64 GB RAM per server
>>>>>>>>>    - 48 cores per server
>>>>>>>>>    - Cassandra 3.0.11
>>>>>>>>>    - 48 Gb heap using G1GC
>>>>>>>>>    - 1 Gbps NICs
>>>>>>>>>
>>>>>>>>> Since I'm using SSD I've tried tuning the following (one at a
>>>>>>>>> time) but none seemed to make a lot of difference:
>>>>>>>>>
>>>>>>>>>    - concurrent_writes=384
>>>>>>>>>    - memtable_flush_writers=8
>>>>>>>>>    - concurrent_compactors=8
>>>>>>>>>
>>>>>>>>> I am currently doing ingestion tests sending data from 3 clients
>>>>>>>>> on the same subnet.  I am using cassandra-stress to do some ingestion
>>>>>>>>> testing.  The tests are using CL=ONE and RF=2.
>>>>>>>>>
>>>>>>>>> Using cassandra-stress (3.10) I am able to saturate the disk using
>>>>>>>>> a large enough column size and the standard five column cassandra-stress
>>>>>>>>> schema.  For example, -col size=fixed(400) will saturate the disk
>>>>>>>>> and compactions will start falling behind.
>>>>>>>>>
>>>>>>>>> One of our main tables has a row size that approximately 200
>>>>>>>>> bytes, across 64 columns.  When ingesting this table I don't see any
>>>>>>>>> resource saturation.  Disk utilization is around 10-15% per iostat.
>>>>>>>>> Incoming network traffic on the servers is around 100-300 Mbps.  CPU
>>>>>>>>> utilization is around 20-70%.  nodetool tpstats shows mostly
>>>>>>>>> zeros with occasional spikes around 500 in MutationStage.
>>>>>>>>>
>>>>>>>>> The stress run does 10,000,000 inserts per client, each with a
>>>>>>>>> separate range of partition IDs.  The run with 200 byte rows takes about 4
>>>>>>>>> minutes, with mean Latency 4.5ms, Total GC time of 21 secs, Avg GC time 173
>>>>>>>>> ms.
>>>>>>>>>
>>>>>>>>> The overall performance is good - around 120k rows/sec ingested.
>>>>>>>>> But I'm curious to know where the bottleneck is.  There's no resource
>>>>>>>>> saturation and nodetool tpstats shows only occasional brief
>>>>>>>>> queueing.  Is the rest just expected latency inside of Cassandra?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> -- Eric
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>

Re: Bottleneck for small inserts?

Posted by Jonathan Haddad <jo...@jonhaddad.com>.

Did you try adding more client stress nodes as Patrick recommended?

On Tue, Jun 13, 2017 at 9:31 PM Eric Pederson <er...@gmail.com> wrote:

> Scratch that theory - the flamegraphs show that GC is only 3-4% of two
> newer machine's overall processing, compared to 18% on the slow machine.
>
> I took that machine out of the cluster completely and recreated the
> keyspaces.  The ingest tests now run slightly faster (!).   I would have
> expected a linear slowdown since the load is fairly balanced across
> partitions.  GC appears to be the bottleneck in the 3-server
> configuration.  But still in the two-server configuration the
> CPU/disk/network is still not being fully utilized (the closest is CPU at
> ~45% on one ingest test).  nodetool tpstats shows only blips of queueing.
>
>
>
>
>
> -- Eric
>
> On Mon, Jun 12, 2017 at 9:50 PM, Eric Pederson <er...@gmail.com> wrote:
>
>> Hi all - I wanted to follow up on this.  I'm happy with the throughput
>> we're getting but I'm still curious about the bottleneck.
>>
>> The big thing that sticks out is one of the nodes is logging frequent
>> GCInspector messages: 350-500ms every 3-6 seconds.  All three nodes in
>> the cluster have identical Cassandra configuration, but the node that is
>> logging frequent GCs is an older machine with slower CPU and SSD.  This
>> node logs frequent GCInspectors both under load and when compacting but
>> otherwise unloaded.
>>
>> My theory is that the other two nodes have similar GC frequency (because
>> they are seeing the same basic load), but because they are faster machines,
>> they don't spend as much time per GC and don't cross the GCInspector
>> threshold.  Does that sound plausible?   nodetool tpstats doesn't show
>> any queueing in the system.
>>
>> Here's flamegraphs from the system when running a cqlsh COPY FROM:
>>
>>    -
>>    http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_cars_batch2.svg
>>    -
>>    http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_cars_batch2.svg
>>    -
>>    http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_cars_batch2.svg
>>
>> The slow node (ultva03) spends disproportional time in GC.
>>
>> Thanks,
>>
>>
>> -- Eric
>>
>> On Thu, May 25, 2017 at 8:09 PM, Eric Pederson <er...@gmail.com> wrote:
>>
>>> Due to a cut and paste error those flamegraphs were a recording of the
>>> whole system, not just Cassandra.    Throughput is approximately 30k
>>> rows/sec.
>>>
>>> Here's the graphs with just the Cassandra PID:
>>>
>>>    -
>>>    http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_sars2.svg
>>>    -
>>>    http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_sars2.svg
>>>    -
>>>    http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_sars2.svg
>>>
>>>
>>> And here's graphs during a cqlsh COPY FROM to the same table, using
>>> real data, MAXBATCHSIZE=2.    Throughput is good at approximately 110k
>>> rows/sec.
>>>
>>>    -
>>>    http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_cars_batch2.svg
>>>    -
>>>    http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_cars_batch2.svg
>>>    -
>>>    http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_cars_batch2.svg
>>>
>>>
>>>
>>>
>>> -- Eric
>>>
>>> On Thu, May 25, 2017 at 6:44 PM, Eric Pederson <er...@gmail.com>
>>> wrote:
>>>
>>>> Totally understood :)
>>>>
>>>> I forgot to mention - I set the /proc/irq/*/smp_affinity mask to
>>>> include all of the CPUs.  Actually most of them were set that way already
>>>> (for example, 0000ffff,ffffffff) - it might be because irqbalanced is
>>>> running.  But for some reason the interrupts are all being handled on CPU 0
>>>> anyway.
>>>>
>>>> I see this in /var/log/dmesg on the machines:
>>>>
>>>>>
>>>>> Your BIOS has requested that x2apic be disabled.
>>>>> This will leave your machine vulnerable to irq-injection attacks.
>>>>> Use 'intremap=no_x2apic_optout' to override BIOS request.
>>>>> Enabled IRQ remapping in xapic mode
>>>>> x2apic not enabled, IRQ remapping is in xapic mode
>>>>
>>>>
>>>> In a reply to one of the comments, he says:
>>>>
>>>>
>>>> When IO-APIC configured to spread interrupts among all cores, it can
>>>>> handle up to eight cores. If you have more than eight cores, kernel will
>>>>> not configure IO-APIC to spread interrupts. Thus the trick I described in
>>>>> the article will not work.
>>>>> Otherwise it may be caused by buggy BIOS or even buggy hardware.
>>>>
>>>>
>>>> I'm not sure if either of them is relevant to my situation.
>>>>
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -- Eric
>>>>
>>>> On Thu, May 25, 2017 at 4:16 PM, Jonathan Haddad <jo...@jonhaddad.com>
>>>> wrote:
>>>>
>>>>> You shouldn't need a kernel recompile.  Check out the section "Simple
>>>>> solution for the problem" in
>>>>> http://www.alexonlinux.com/smp-affinity-and-proper-interrupt-handling-in-linux.
>>>>> You can balance your requests across up to 8 CPUs.
>>>>>
>>>>> I'll check out the flame graphs in a little bit - in the middle of
>>>>> something and my brain doesn't multitask well :)
>>>>>
>>>>> On Thu, May 25, 2017 at 1:06 PM Eric Pederson <er...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Jonathan -
>>>>>>
>>>>>> It looks like these machines are configured to use CPU 0 for all I/O
>>>>>> interrupts.  I don't think I'm going to get the OK to compile a new kernel
>>>>>> for them to balance the interrupts across CPUs, but to mitigate the problem
>>>>>> I taskset the Cassandra process to run on all CPU except 0.  It didn't
>>>>>> change the performance though.  Let me know if you think it's crucial that
>>>>>> we balance the interrupts across CPUs and I can try to lobby for a new
>>>>>> kernel.
>>>>>>
>>>>>> Here are flamegraphs from each node from a cassandra-stress ingest
>>>>>> into a table representative of the what we are going to be using.   This
>>>>>> table is also roughly 200 bytes, with 64 columns and the primary key (date,
>>>>>> sequence_number).  Cassandra-stress was run on 3 separate client
>>>>>> machines.  Using cassandra-stress to write to this table I see the
>>>>>> same thing: neither disk, CPU or network is fully utilized.
>>>>>>
>>>>>>    -
>>>>>>    http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_sars.svg
>>>>>>    -
>>>>>>    http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_sars.svg
>>>>>>    -
>>>>>>    http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_sars.svg
>>>>>>
>>>>>> Re: GC: In the stress run with the parameters above, two of the three
>>>>>> nodes log zero or one GCInspectors.  On the other hand, the 3rd
>>>>>> machine logs a GCInspector every 5 seconds or so, 300-500ms each
>>>>>> time.  I found out that the 3rd machine actually has different specs as the
>>>>>> other two.  It's an older box with the same RAM but less CPUs (32 instead
>>>>>> of 48), a slower SSD and slower memory.   The Cassandra configuration is
>>>>>> exactly the same.   I tried running Cassandra with only 32 CPUs on the
>>>>>> newer boxes to see if that would cause them to GC pause more, but it didn't.
>>>>>>
>>>>>> On a separate topic - for this cassandra-stress run I reduced the
>>>>>> batch size to 2 in order to keep the logs clean.  That also reduced the
>>>>>> throughput from around 100k rows/second to 32k rows/sec.  I've been doing
>>>>>> ingestion tests using cassandra-stress, cqlsh COPY FROM and a custom
>>>>>> C++ application.  In most of the tests that I've been doing I've been using
>>>>>> a batch size of around 20 (unlogged, all batch rows have the same partition
>>>>>> key).  However, it fills the logs with batch size warnings.  I was going to
>>>>>> raise the batch warning size but the docs scared me away from doing that.
>>>>>> Given that we're using unlogged/same partition batches is it safe to raise
>>>>>> the batch size warning limit?   Actually cqlsh COPY FROM has very
>>>>>> good throughput using a small batch size, but I can't get that same
>>>>>> throughput in cassandra-stress or my C++ app with a batch size of 2.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>>>
>>>>>> -- Eric
>>>>>>
>>>>>> On Mon, May 22, 2017 at 5:08 PM, Jonathan Haddad <jo...@jonhaddad.com>
>>>>>> wrote:
>>>>>>
>>>>>>> How many CPUs are you using for interrupts?
>>>>>>> http://www.alexonlinux.com/smp-affinity-and-proper-interrupt-handling-in-linux
>>>>>>>
>>>>>>> Have you tried making a flame graph to see where Cassandra is
>>>>>>> spending its time?
>>>>>>> http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html
>>>>>>>
>>>>>>> Are you tracking GC pauses?
>>>>>>>
>>>>>>> Jon
>>>>>>>
>>>>>>> On Mon, May 22, 2017 at 2:03 PM Eric Pederson <er...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi all:
>>>>>>>>
>>>>>>>> I'm new to Cassandra and I'm doing some performance testing.  One
>>>>>>>> of things that I'm testing is ingestion throughput.   My server setup is:
>>>>>>>>
>>>>>>>>    - 3 node cluster
>>>>>>>>    - SSD data (both commit log and sstables are on the same disk)
>>>>>>>>    - 64 GB RAM per server
>>>>>>>>    - 48 cores per server
>>>>>>>>    - Cassandra 3.0.11
>>>>>>>>    - 48 Gb heap using G1GC
>>>>>>>>    - 1 Gbps NICs
>>>>>>>>
>>>>>>>> Since I'm using SSD I've tried tuning the following (one at a time)
>>>>>>>> but none seemed to make a lot of difference:
>>>>>>>>
>>>>>>>>    - concurrent_writes=384
>>>>>>>>    - memtable_flush_writers=8
>>>>>>>>    - concurrent_compactors=8
>>>>>>>>
>>>>>>>> I am currently doing ingestion tests sending data from 3 clients on
>>>>>>>> the same subnet.  I am using cassandra-stress to do some ingestion
>>>>>>>> testing.  The tests are using CL=ONE and RF=2.
>>>>>>>>
>>>>>>>> Using cassandra-stress (3.10) I am able to saturate the disk using
>>>>>>>> a large enough column size and the standard five column cassandra-stress
>>>>>>>> schema.  For example, -col size=fixed(400) will saturate the disk
>>>>>>>> and compactions will start falling behind.
>>>>>>>>
>>>>>>>> One of our main tables has a row size that approximately 200 bytes,
>>>>>>>> across 64 columns.  When ingesting this table I don't see any resource
>>>>>>>> saturation.  Disk utilization is around 10-15% per iostat.
>>>>>>>> Incoming network traffic on the servers is around 100-300 Mbps.  CPU
>>>>>>>> utilization is around 20-70%.  nodetool tpstats shows mostly zeros
>>>>>>>> with occasional spikes around 500 in MutationStage.
>>>>>>>>
>>>>>>>> The stress run does 10,000,000 inserts per client, each with a
>>>>>>>> separate range of partition IDs.  The run with 200 byte rows takes about 4
>>>>>>>> minutes, with mean Latency 4.5ms, Total GC time of 21 secs, Avg GC time 173
>>>>>>>> ms.
>>>>>>>>
>>>>>>>> The overall performance is good - around 120k rows/sec ingested.
>>>>>>>> But I'm curious to know where the bottleneck is.  There's no resource
>>>>>>>> saturation and nodetool tpstats shows only occasional brief
>>>>>>>> queueing.  Is the rest just expected latency inside of Cassandra?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> -- Eric
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>
>

Re: Bottleneck for small inserts?

Posted by Eric Pederson <er...@gmail.com>.

Scratch that theory - the flamegraphs show that GC is only 3-4% of two
newer machine's overall processing, compared to 18% on the slow machine.

I took that machine out of the cluster completely and recreated the
keyspaces.  The ingest tests now run slightly faster (!).   I would have
expected a linear slowdown since the load is fairly balanced across
partitions.  GC appears to be the bottleneck in the 3-server
configuration.  But still in the two-server configuration the
CPU/disk/network is still not being fully utilized (the closest is CPU at
~45% on one ingest test).  nodetool tpstats shows only blips of queueing.




-- Eric

On Mon, Jun 12, 2017 at 9:50 PM, Eric Pederson <er...@gmail.com> wrote:

> Hi all - I wanted to follow up on this.  I'm happy with the throughput
> we're getting but I'm still curious about the bottleneck.
>
> The big thing that sticks out is one of the nodes is logging frequent
> GCInspector messages: 350-500ms every 3-6 seconds.  All three nodes in
> the cluster have identical Cassandra configuration, but the node that is
> logging frequent GCs is an older machine with slower CPU and SSD.  This
> node logs frequent GCInspectors both under load and when compacting but
> otherwise unloaded.
>
> My theory is that the other two nodes have similar GC frequency (because
> they are seeing the same basic load), but because they are faster machines,
> they don't spend as much time per GC and don't cross the GCInspector
> threshold.  Does that sound plausible?   nodetool tpstats doesn't show
> any queueing in the system.
>
> Here's flamegraphs from the system when running a cqlsh COPY FROM:
>
>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/
>    05/flamegraph_ultva01_cars_batch2.svg
>    <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_cars_batch2.svg>
>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/
>    05/flamegraph_ultva02_cars_batch2.svg
>    <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_cars_batch2.svg>
>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/
>    05/flamegraph_ultva03_cars_batch2.svg
>    <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_cars_batch2.svg>
>
> The slow node (ultva03) spends disproportional time in GC.
>
> Thanks,
>
>
> -- Eric
>
> On Thu, May 25, 2017 at 8:09 PM, Eric Pederson <er...@gmail.com> wrote:
>
>> Due to a cut and paste error those flamegraphs were a recording of the
>> whole system, not just Cassandra.    Throughput is approximately 30k
>> rows/sec.
>>
>> Here's the graphs with just the Cassandra PID:
>>
>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/
>>    05/flamegraph_ultva01_sars2.svg
>>    <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_sars2.svg>
>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/
>>    05/flamegraph_ultva02_sars2.svg
>>    <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_sars2.svg>
>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/
>>    05/flamegraph_ultva03_sars2.svg
>>    <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_sars2.svg>
>>
>>
>> And here's graphs during a cqlsh COPY FROM to the same table, using real
>> data, MAXBATCHSIZE=2.    Throughput is good at approximately 110k
>> rows/sec.
>>
>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/
>>    05/flamegraph_ultva01_cars_batch2.svg
>>    <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_cars_batch2.svg>
>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/
>>    05/flamegraph_ultva02_cars_batch2.svg
>>    <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_cars_batch2.svg>
>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/
>>    05/flamegraph_ultva03_cars_batch2.svg
>>    <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_cars_batch2.svg>
>>
>>
>>
>>
>> -- Eric
>>
>> On Thu, May 25, 2017 at 6:44 PM, Eric Pederson <er...@gmail.com> wrote:
>>
>>> Totally understood :)
>>>
>>> I forgot to mention - I set the /proc/irq/*/smp_affinity mask to
>>> include all of the CPUs.  Actually most of them were set that way already
>>> (for example, 0000ffff,ffffffff) - it might be because irqbalanced is
>>> running.  But for some reason the interrupts are all being handled on CPU 0
>>> anyway.
>>>
>>> I see this in /var/log/dmesg on the machines:
>>>
>>>>
>>>> Your BIOS has requested that x2apic be disabled.
>>>> This will leave your machine vulnerable to irq-injection attacks.
>>>> Use 'intremap=no_x2apic_optout' to override BIOS request.
>>>> Enabled IRQ remapping in xapic mode
>>>> x2apic not enabled, IRQ remapping is in xapic mode
>>>
>>>
>>> In a reply to one of the comments, he says:
>>>
>>>
>>> When IO-APIC configured to spread interrupts among all cores, it can
>>>> handle up to eight cores. If you have more than eight cores, kernel will
>>>> not configure IO-APIC to spread interrupts. Thus the trick I described in
>>>> the article will not work.
>>>> Otherwise it may be caused by buggy BIOS or even buggy hardware.
>>>
>>>
>>> I'm not sure if either of them is relevant to my situation.
>>>
>>>
>>> Thanks!
>>>
>>>
>>>
>>>
>>>
>>> -- Eric
>>>
>>> On Thu, May 25, 2017 at 4:16 PM, Jonathan Haddad <jo...@jonhaddad.com>
>>> wrote:
>>>
>>>> You shouldn't need a kernel recompile.  Check out the section "Simple
>>>> solution for the problem" in http://www.alexonlinux.com/
>>>> smp-affinity-and-proper-interrupt-handling-in-linux.  You can balance
>>>> your requests across up to 8 CPUs.
>>>>
>>>> I'll check out the flame graphs in a little bit - in the middle of
>>>> something and my brain doesn't multitask well :)
>>>>
>>>> On Thu, May 25, 2017 at 1:06 PM Eric Pederson <er...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Jonathan -
>>>>>
>>>>> It looks like these machines are configured to use CPU 0 for all I/O
>>>>> interrupts.  I don't think I'm going to get the OK to compile a new kernel
>>>>> for them to balance the interrupts across CPUs, but to mitigate the problem
>>>>> I taskset the Cassandra process to run on all CPU except 0.  It didn't
>>>>> change the performance though.  Let me know if you think it's crucial that
>>>>> we balance the interrupts across CPUs and I can try to lobby for a new
>>>>> kernel.
>>>>>
>>>>> Here are flamegraphs from each node from a cassandra-stress ingest
>>>>> into a table representative of the what we are going to be using.   This
>>>>> table is also roughly 200 bytes, with 64 columns and the primary key (date,
>>>>> sequence_number).  Cassandra-stress was run on 3 separate client
>>>>> machines.  Using cassandra-stress to write to this table I see the
>>>>> same thing: neither disk, CPU or network is fully utilized.
>>>>>
>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>    /flamegraph_ultva01_sars.svg
>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>    /flamegraph_ultva02_sars.svg
>>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>>    /flamegraph_ultva03_sars.svg
>>>>>
>>>>> Re: GC: In the stress run with the parameters above, two of the three
>>>>> nodes log zero or one GCInspectors.  On the other hand, the 3rd
>>>>> machine logs a GCInspector every 5 seconds or so, 300-500ms each
>>>>> time.  I found out that the 3rd machine actually has different specs as the
>>>>> other two.  It's an older box with the same RAM but less CPUs (32 instead
>>>>> of 48), a slower SSD and slower memory.   The Cassandra configuration is
>>>>> exactly the same.   I tried running Cassandra with only 32 CPUs on the
>>>>> newer boxes to see if that would cause them to GC pause more, but it didn't.
>>>>>
>>>>> On a separate topic - for this cassandra-stress run I reduced the
>>>>> batch size to 2 in order to keep the logs clean.  That also reduced the
>>>>> throughput from around 100k rows/second to 32k rows/sec.  I've been doing
>>>>> ingestion tests using cassandra-stress, cqlsh COPY FROM and a custom
>>>>> C++ application.  In most of the tests that I've been doing I've been using
>>>>> a batch size of around 20 (unlogged, all batch rows have the same partition
>>>>> key).  However, it fills the logs with batch size warnings.  I was going to
>>>>> raise the batch warning size but the docs scared me away from doing that.
>>>>> Given that we're using unlogged/same partition batches is it safe to raise
>>>>> the batch size warning limit?   Actually cqlsh COPY FROM has very
>>>>> good throughput using a small batch size, but I can't get that same
>>>>> throughput in cassandra-stress or my C++ app with a batch size of 2.
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>>
>>>>> -- Eric
>>>>>
>>>>> On Mon, May 22, 2017 at 5:08 PM, Jonathan Haddad <jo...@jonhaddad.com>
>>>>> wrote:
>>>>>
>>>>>> How many CPUs are you using for interrupts?
>>>>>> http://www.alexonlinux.com/smp-affinity-and-proper-interrup
>>>>>> t-handling-in-linux
>>>>>>
>>>>>> Have you tried making a flame graph to see where Cassandra is
>>>>>> spending its time? http://www.brendangregg.
>>>>>> com/blog/2014-06-12/java-flame-graphs.html
>>>>>>
>>>>>> Are you tracking GC pauses?
>>>>>>
>>>>>> Jon
>>>>>>
>>>>>> On Mon, May 22, 2017 at 2:03 PM Eric Pederson <er...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all:
>>>>>>>
>>>>>>> I'm new to Cassandra and I'm doing some performance testing.  One of
>>>>>>> things that I'm testing is ingestion throughput.   My server setup is:
>>>>>>>
>>>>>>>    - 3 node cluster
>>>>>>>    - SSD data (both commit log and sstables are on the same disk)
>>>>>>>    - 64 GB RAM per server
>>>>>>>    - 48 cores per server
>>>>>>>    - Cassandra 3.0.11
>>>>>>>    - 48 Gb heap using G1GC
>>>>>>>    - 1 Gbps NICs
>>>>>>>
>>>>>>> Since I'm using SSD I've tried tuning the following (one at a time)
>>>>>>> but none seemed to make a lot of difference:
>>>>>>>
>>>>>>>    - concurrent_writes=384
>>>>>>>    - memtable_flush_writers=8
>>>>>>>    - concurrent_compactors=8
>>>>>>>
>>>>>>> I am currently doing ingestion tests sending data from 3 clients on
>>>>>>> the same subnet.  I am using cassandra-stress to do some ingestion
>>>>>>> testing.  The tests are using CL=ONE and RF=2.
>>>>>>>
>>>>>>> Using cassandra-stress (3.10) I am able to saturate the disk using a
>>>>>>> large enough column size and the standard five column cassandra-stress
>>>>>>> schema.  For example, -col size=fixed(400) will saturate the disk
>>>>>>> and compactions will start falling behind.
>>>>>>>
>>>>>>> One of our main tables has a row size that approximately 200 bytes,
>>>>>>> across 64 columns.  When ingesting this table I don't see any resource
>>>>>>> saturation.  Disk utilization is around 10-15% per iostat.
>>>>>>> Incoming network traffic on the servers is around 100-300 Mbps.  CPU
>>>>>>> utilization is around 20-70%.  nodetool tpstats shows mostly zeros
>>>>>>> with occasional spikes around 500 in MutationStage.
>>>>>>>
>>>>>>> The stress run does 10,000,000 inserts per client, each with a
>>>>>>> separate range of partition IDs.  The run with 200 byte rows takes about 4
>>>>>>> minutes, with mean Latency 4.5ms, Total GC time of 21 secs, Avg GC time 173
>>>>>>> ms.
>>>>>>>
>>>>>>> The overall performance is good - around 120k rows/sec ingested.
>>>>>>> But I'm curious to know where the bottleneck is.  There's no resource
>>>>>>> saturation and nodetool tpstats shows only occasional brief
>>>>>>> queueing.  Is the rest just expected latency inside of Cassandra?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> -- Eric
>>>>>>>
>>>>>>
>>>>>
>>>
>>
>

Re: Bottleneck for small inserts?

Posted by Eric Pederson <er...@gmail.com>.

Hi all - I wanted to follow up on this.  I'm happy with the throughput
we're getting but I'm still curious about the bottleneck.

The big thing that sticks out is one of the nodes is logging frequent
GCInspector messages: 350-500ms every 3-6 seconds.  All three nodes in the
cluster have identical Cassandra configuration, but the node that is
logging frequent GCs is an older machine with slower CPU and SSD.  This
node logs frequent GCInspectors both under load and when compacting but
otherwise unloaded.

My theory is that the other two nodes have similar GC frequency (because
they are seeing the same basic load), but because they are faster machines,
they don't spend as much time per GC and don't cross the GCInspector
threshold.  Does that sound plausible?   nodetool tpstats doesn't show any
queueing in the system.

Here's flamegraphs from the system when running a cqlsh COPY FROM:

   - http://sourcedelica.com/wordpress/wp-content/uploads/
   2017/05/flamegraph_ultva01_cars_batch2.svg
   <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_cars_batch2.svg>
   - http://sourcedelica.com/wordpress/wp-content/uploads/
   2017/05/flamegraph_ultva02_cars_batch2.svg
   <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_cars_batch2.svg>
   - http://sourcedelica.com/wordpress/wp-content/uploads/
   2017/05/flamegraph_ultva03_cars_batch2.svg
   <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_cars_batch2.svg>

The slow node (ultva03) spends disproportional time in GC.

Thanks,


-- Eric

On Thu, May 25, 2017 at 8:09 PM, Eric Pederson <er...@gmail.com> wrote:

> Due to a cut and paste error those flamegraphs were a recording of the
> whole system, not just Cassandra.    Throughput is approximately 30k
> rows/sec.
>
> Here's the graphs with just the Cassandra PID:
>
>    - http://sourcedelica.com/wordpress/wp-content/uploads/
>    2017/05/flamegraph_ultva01_sars2.svg
>    <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_sars2.svg>
>    - http://sourcedelica.com/wordpress/wp-content/uploads/
>    2017/05/flamegraph_ultva02_sars2.svg
>    <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_sars2.svg>
>    - http://sourcedelica.com/wordpress/wp-content/uploads/
>    2017/05/flamegraph_ultva03_sars2.svg
>    <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_sars2.svg>
>
>
> And here's graphs during a cqlsh COPY FROM to the same table, using real
> data, MAXBATCHSIZE=2.    Throughput is good at approximately 110k
> rows/sec.
>
>    - http://sourcedelica.com/wordpress/wp-content/uploads/
>    2017/05/flamegraph_ultva01_cars_batch2.svg
>    <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_cars_batch2.svg>
>    - http://sourcedelica.com/wordpress/wp-content/uploads/
>    2017/05/flamegraph_ultva02_cars_batch2.svg
>    <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_cars_batch2.svg>
>    - http://sourcedelica.com/wordpress/wp-content/uploads/
>    2017/05/flamegraph_ultva03_cars_batch2.svg
>    <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_cars_batch2.svg>
>
>
>
>
> -- Eric
>
> On Thu, May 25, 2017 at 6:44 PM, Eric Pederson <er...@gmail.com> wrote:
>
>> Totally understood :)
>>
>> I forgot to mention - I set the /proc/irq/*/smp_affinity mask to include
>> all of the CPUs.  Actually most of them were set that way already (for
>> example, 0000ffff,ffffffff) - it might be because irqbalanced is
>> running.  But for some reason the interrupts are all being handled on CPU 0
>> anyway.
>>
>> I see this in /var/log/dmesg on the machines:
>>
>>>
>>> Your BIOS has requested that x2apic be disabled.
>>> This will leave your machine vulnerable to irq-injection attacks.
>>> Use 'intremap=no_x2apic_optout' to override BIOS request.
>>> Enabled IRQ remapping in xapic mode
>>> x2apic not enabled, IRQ remapping is in xapic mode
>>
>>
>> In a reply to one of the comments, he says:
>>
>>
>> When IO-APIC configured to spread interrupts among all cores, it can
>>> handle up to eight cores. If you have more than eight cores, kernel will
>>> not configure IO-APIC to spread interrupts. Thus the trick I described in
>>> the article will not work.
>>> Otherwise it may be caused by buggy BIOS or even buggy hardware.
>>
>>
>> I'm not sure if either of them is relevant to my situation.
>>
>>
>> Thanks!
>>
>>
>>
>>
>>
>> -- Eric
>>
>> On Thu, May 25, 2017 at 4:16 PM, Jonathan Haddad <jo...@jonhaddad.com>
>> wrote:
>>
>>> You shouldn't need a kernel recompile.  Check out the section "Simple
>>> solution for the problem" in http://www.alexonlinux.com/
>>> smp-affinity-and-proper-interrupt-handling-in-linux.  You can balance
>>> your requests across up to 8 CPUs.
>>>
>>> I'll check out the flame graphs in a little bit - in the middle of
>>> something and my brain doesn't multitask well :)
>>>
>>> On Thu, May 25, 2017 at 1:06 PM Eric Pederson <er...@gmail.com> wrote:
>>>
>>>> Hi Jonathan -
>>>>
>>>> It looks like these machines are configured to use CPU 0 for all I/O
>>>> interrupts.  I don't think I'm going to get the OK to compile a new kernel
>>>> for them to balance the interrupts across CPUs, but to mitigate the problem
>>>> I taskset the Cassandra process to run on all CPU except 0.  It didn't
>>>> change the performance though.  Let me know if you think it's crucial that
>>>> we balance the interrupts across CPUs and I can try to lobby for a new
>>>> kernel.
>>>>
>>>> Here are flamegraphs from each node from a cassandra-stress ingest
>>>> into a table representative of the what we are going to be using.   This
>>>> table is also roughly 200 bytes, with 64 columns and the primary key (date,
>>>> sequence_number).  Cassandra-stress was run on 3 separate client
>>>> machines.  Using cassandra-stress to write to this table I see the
>>>> same thing: neither disk, CPU or network is fully utilized.
>>>>
>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>    /flamegraph_ultva01_sars.svg
>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>    /flamegraph_ultva02_sars.svg
>>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/05
>>>>    /flamegraph_ultva03_sars.svg
>>>>
>>>> Re: GC: In the stress run with the parameters above, two of the three
>>>> nodes log zero or one GCInspectors.  On the other hand, the 3rd
>>>> machine logs a GCInspector every 5 seconds or so, 300-500ms each
>>>> time.  I found out that the 3rd machine actually has different specs as the
>>>> other two.  It's an older box with the same RAM but less CPUs (32 instead
>>>> of 48), a slower SSD and slower memory.   The Cassandra configuration is
>>>> exactly the same.   I tried running Cassandra with only 32 CPUs on the
>>>> newer boxes to see if that would cause them to GC pause more, but it didn't.
>>>>
>>>> On a separate topic - for this cassandra-stress run I reduced the
>>>> batch size to 2 in order to keep the logs clean.  That also reduced the
>>>> throughput from around 100k rows/second to 32k rows/sec.  I've been doing
>>>> ingestion tests using cassandra-stress, cqlsh COPY FROM and a custom
>>>> C++ application.  In most of the tests that I've been doing I've been using
>>>> a batch size of around 20 (unlogged, all batch rows have the same partition
>>>> key).  However, it fills the logs with batch size warnings.  I was going to
>>>> raise the batch warning size but the docs scared me away from doing that.
>>>> Given that we're using unlogged/same partition batches is it safe to raise
>>>> the batch size warning limit?   Actually cqlsh COPY FROM has very good
>>>> throughput using a small batch size, but I can't get that same throughput
>>>> in cassandra-stress or my C++ app with a batch size of 2.
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>
>>>> -- Eric
>>>>
>>>> On Mon, May 22, 2017 at 5:08 PM, Jonathan Haddad <jo...@jonhaddad.com>
>>>> wrote:
>>>>
>>>>> How many CPUs are you using for interrupts?
>>>>> http://www.alexonlinux.com/smp-affinity-and-proper-interrup
>>>>> t-handling-in-linux
>>>>>
>>>>> Have you tried making a flame graph to see where Cassandra is spending
>>>>> its time? http://www.brendangregg.com/blog/2014-06-12/java-flame
>>>>> -graphs.html
>>>>>
>>>>> Are you tracking GC pauses?
>>>>>
>>>>> Jon
>>>>>
>>>>> On Mon, May 22, 2017 at 2:03 PM Eric Pederson <er...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi all:
>>>>>>
>>>>>> I'm new to Cassandra and I'm doing some performance testing.  One of
>>>>>> things that I'm testing is ingestion throughput.   My server setup is:
>>>>>>
>>>>>>    - 3 node cluster
>>>>>>    - SSD data (both commit log and sstables are on the same disk)
>>>>>>    - 64 GB RAM per server
>>>>>>    - 48 cores per server
>>>>>>    - Cassandra 3.0.11
>>>>>>    - 48 Gb heap using G1GC
>>>>>>    - 1 Gbps NICs
>>>>>>
>>>>>> Since I'm using SSD I've tried tuning the following (one at a time)
>>>>>> but none seemed to make a lot of difference:
>>>>>>
>>>>>>    - concurrent_writes=384
>>>>>>    - memtable_flush_writers=8
>>>>>>    - concurrent_compactors=8
>>>>>>
>>>>>> I am currently doing ingestion tests sending data from 3 clients on
>>>>>> the same subnet.  I am using cassandra-stress to do some ingestion
>>>>>> testing.  The tests are using CL=ONE and RF=2.
>>>>>>
>>>>>> Using cassandra-stress (3.10) I am able to saturate the disk using a
>>>>>> large enough column size and the standard five column cassandra-stress
>>>>>> schema.  For example, -col size=fixed(400) will saturate the disk
>>>>>> and compactions will start falling behind.
>>>>>>
>>>>>> One of our main tables has a row size that approximately 200 bytes,
>>>>>> across 64 columns.  When ingesting this table I don't see any resource
>>>>>> saturation.  Disk utilization is around 10-15% per iostat.  Incoming
>>>>>> network traffic on the servers is around 100-300 Mbps.  CPU utilization is
>>>>>> around 20-70%.  nodetool tpstats shows mostly zeros with occasional
>>>>>> spikes around 500 in MutationStage.
>>>>>>
>>>>>> The stress run does 10,000,000 inserts per client, each with a
>>>>>> separate range of partition IDs.  The run with 200 byte rows takes about 4
>>>>>> minutes, with mean Latency 4.5ms, Total GC time of 21 secs, Avg GC time 173
>>>>>> ms.
>>>>>>
>>>>>> The overall performance is good - around 120k rows/sec ingested.  But
>>>>>> I'm curious to know where the bottleneck is.  There's no resource
>>>>>> saturation and nodetool tpstats shows only occasional brief
>>>>>> queueing.  Is the rest just expected latency inside of Cassandra?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> -- Eric
>>>>>>
>>>>>
>>>>
>>
>

Re: Bottleneck for small inserts?

Posted by Eric Pederson <er...@gmail.com>.

Due to a cut and paste error those flamegraphs were a recording of the
whole system, not just Cassandra.    Throughput is approximately 30k
rows/sec.

Here's the graphs with just the Cassandra PID:

   -
   http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_sars2.svg
   -
   http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_sars2.svg
   -
   http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_sars2.svg


And here's graphs during a cqlsh COPY FROM to the same table, using real
data, MAXBATCHSIZE=2.    Throughput is good at approximately 110k rows/sec.

   -
   http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_cars_batch2.svg
   -
   http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_cars_batch2.svg
   -
   http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_cars_batch2.svg




-- Eric

On Thu, May 25, 2017 at 6:44 PM, Eric Pederson <er...@gmail.com> wrote:

> Totally understood :)
>
> I forgot to mention - I set the /proc/irq/*/smp_affinity mask to include
> all of the CPUs.  Actually most of them were set that way already (for
> example, 0000ffff,ffffffff) - it might be because irqbalanced is
> running.  But for some reason the interrupts are all being handled on CPU 0
> anyway.
>
> I see this in /var/log/dmesg on the machines:
>
>>
>> Your BIOS has requested that x2apic be disabled.
>> This will leave your machine vulnerable to irq-injection attacks.
>> Use 'intremap=no_x2apic_optout' to override BIOS request.
>> Enabled IRQ remapping in xapic mode
>> x2apic not enabled, IRQ remapping is in xapic mode
>
>
> In a reply to one of the comments, he says:
>
>
> When IO-APIC configured to spread interrupts among all cores, it can
>> handle up to eight cores. If you have more than eight cores, kernel will
>> not configure IO-APIC to spread interrupts. Thus the trick I described in
>> the article will not work.
>> Otherwise it may be caused by buggy BIOS or even buggy hardware.
>
>
> I'm not sure if either of them is relevant to my situation.
>
>
> Thanks!
>
>
>
>
>
> -- Eric
>
> On Thu, May 25, 2017 at 4:16 PM, Jonathan Haddad <jo...@jonhaddad.com>
> wrote:
>
>> You shouldn't need a kernel recompile.  Check out the section "Simple
>> solution for the problem" in http://www.alexonlinux.com/
>> smp-affinity-and-proper-interrupt-handling-in-linux.  You can balance
>> your requests across up to 8 CPUs.
>>
>> I'll check out the flame graphs in a little bit - in the middle of
>> something and my brain doesn't multitask well :)
>>
>> On Thu, May 25, 2017 at 1:06 PM Eric Pederson <er...@gmail.com> wrote:
>>
>>> Hi Jonathan -
>>>
>>> It looks like these machines are configured to use CPU 0 for all I/O
>>> interrupts.  I don't think I'm going to get the OK to compile a new kernel
>>> for them to balance the interrupts across CPUs, but to mitigate the problem
>>> I taskset the Cassandra process to run on all CPU except 0.  It didn't
>>> change the performance though.  Let me know if you think it's crucial that
>>> we balance the interrupts across CPUs and I can try to lobby for a new
>>> kernel.
>>>
>>> Here are flamegraphs from each node from a cassandra-stress ingest into
>>> a table representative of the what we are going to be using.   This table
>>> is also roughly 200 bytes, with 64 columns and the primary key (date,
>>> sequence_number).  Cassandra-stress was run on 3 separate client
>>> machines.  Using cassandra-stress to write to this table I see the same
>>> thing: neither disk, CPU or network is fully utilized.
>>>
>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/
>>>    05/flamegraph_ultva01_sars.svg
>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/
>>>    05/flamegraph_ultva02_sars.svg
>>>    - http://sourcedelica.com/wordpress/wp-content/uploads/2017/
>>>    05/flamegraph_ultva03_sars.svg
>>>
>>> Re: GC: In the stress run with the parameters above, two of the three
>>> nodes log zero or one GCInspectors.  On the other hand, the 3rd machine
>>> logs a GCInspector every 5 seconds or so, 300-500ms each time.  I found
>>> out that the 3rd machine actually has different specs as the other two.
>>> It's an older box with the same RAM but less CPUs (32 instead of 48), a
>>> slower SSD and slower memory.   The Cassandra configuration is exactly the
>>> same.   I tried running Cassandra with only 32 CPUs on the newer boxes to
>>> see if that would cause them to GC pause more, but it didn't.
>>>
>>> On a separate topic - for this cassandra-stress run I reduced the batch
>>> size to 2 in order to keep the logs clean.  That also reduced the
>>> throughput from around 100k rows/second to 32k rows/sec.  I've been doing
>>> ingestion tests using cassandra-stress, cqlsh COPY FROM and a custom
>>> C++ application.  In most of the tests that I've been doing I've been using
>>> a batch size of around 20 (unlogged, all batch rows have the same partition
>>> key).  However, it fills the logs with batch size warnings.  I was going to
>>> raise the batch warning size but the docs scared me away from doing that.
>>> Given that we're using unlogged/same partition batches is it safe to raise
>>> the batch size warning limit?   Actually cqlsh COPY FROM has very good
>>> throughput using a small batch size, but I can't get that same throughput
>>> in cassandra-stress or my C++ app with a batch size of 2.
>>>
>>> Thanks!
>>>
>>>
>>>
>>> -- Eric
>>>
>>> On Mon, May 22, 2017 at 5:08 PM, Jonathan Haddad <jo...@jonhaddad.com>
>>> wrote:
>>>
>>>> How many CPUs are you using for interrupts?
>>>> http://www.alexonlinux.com/smp-affinity-and-proper-interrup
>>>> t-handling-in-linux
>>>>
>>>> Have you tried making a flame graph to see where Cassandra is spending
>>>> its time? http://www.brendangregg.com/blog/2014-06-12/java-flame
>>>> -graphs.html
>>>>
>>>> Are you tracking GC pauses?
>>>>
>>>> Jon
>>>>
>>>> On Mon, May 22, 2017 at 2:03 PM Eric Pederson <er...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all:
>>>>>
>>>>> I'm new to Cassandra and I'm doing some performance testing.  One of
>>>>> things that I'm testing is ingestion throughput.   My server setup is:
>>>>>
>>>>>    - 3 node cluster
>>>>>    - SSD data (both commit log and sstables are on the same disk)
>>>>>    - 64 GB RAM per server
>>>>>    - 48 cores per server
>>>>>    - Cassandra 3.0.11
>>>>>    - 48 Gb heap using G1GC
>>>>>    - 1 Gbps NICs
>>>>>
>>>>> Since I'm using SSD I've tried tuning the following (one at a time)
>>>>> but none seemed to make a lot of difference:
>>>>>
>>>>>    - concurrent_writes=384
>>>>>    - memtable_flush_writers=8
>>>>>    - concurrent_compactors=8
>>>>>
>>>>> I am currently doing ingestion tests sending data from 3 clients on
>>>>> the same subnet.  I am using cassandra-stress to do some ingestion
>>>>> testing.  The tests are using CL=ONE and RF=2.
>>>>>
>>>>> Using cassandra-stress (3.10) I am able to saturate the disk using a
>>>>> large enough column size and the standard five column cassandra-stress
>>>>> schema.  For example, -col size=fixed(400) will saturate the disk and
>>>>> compactions will start falling behind.
>>>>>
>>>>> One of our main tables has a row size that approximately 200 bytes,
>>>>> across 64 columns.  When ingesting this table I don't see any resource
>>>>> saturation.  Disk utilization is around 10-15% per iostat.  Incoming
>>>>> network traffic on the servers is around 100-300 Mbps.  CPU utilization is
>>>>> around 20-70%.  nodetool tpstats shows mostly zeros with occasional
>>>>> spikes around 500 in MutationStage.
>>>>>
>>>>> The stress run does 10,000,000 inserts per client, each with a
>>>>> separate range of partition IDs.  The run with 200 byte rows takes about 4
>>>>> minutes, with mean Latency 4.5ms, Total GC time of 21 secs, Avg GC time 173
>>>>> ms.
>>>>>
>>>>> The overall performance is good - around 120k rows/sec ingested.  But
>>>>> I'm curious to know where the bottleneck is.  There's no resource
>>>>> saturation and nodetool tpstats shows only occasional brief
>>>>> queueing.  Is the rest just expected latency inside of Cassandra?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> -- Eric
>>>>>
>>>>
>>>
>

Re: Bottleneck for small inserts?

Posted by Eric Pederson <er...@gmail.com>.

Totally understood :)

I forgot to mention - I set the /proc/irq/*/smp_affinity mask to include
all of the CPUs.  Actually most of them were set that way already (for
example, 0000ffff,ffffffff) - it might be because irqbalanced is running.
But for some reason the interrupts are all being handled on CPU 0 anyway.

I see this in /var/log/dmesg on the machines:

>
> Your BIOS has requested that x2apic be disabled.
> This will leave your machine vulnerable to irq-injection attacks.
> Use 'intremap=no_x2apic_optout' to override BIOS request.
> Enabled IRQ remapping in xapic mode
> x2apic not enabled, IRQ remapping is in xapic mode


In a reply to one of the comments, he says:


When IO-APIC configured to spread interrupts among all cores, it can handle
> up to eight cores. If you have more than eight cores, kernel will not
> configure IO-APIC to spread interrupts. Thus the trick I described in the
> article will not work.
> Otherwise it may be caused by buggy BIOS or even buggy hardware.


I'm not sure if either of them is relevant to my situation.


Thanks!





-- Eric

On Thu, May 25, 2017 at 4:16 PM, Jonathan Haddad <jo...@jonhaddad.com> wrote:

> You shouldn't need a kernel recompile.  Check out the section "Simple
> solution for the problem" in http://www.alexonlinux.com/
> smp-affinity-and-proper-interrupt-handling-in-linux.  You can balance
> your requests across up to 8 CPUs.
>
> I'll check out the flame graphs in a little bit - in the middle of
> something and my brain doesn't multitask well :)
>
> On Thu, May 25, 2017 at 1:06 PM Eric Pederson <er...@gmail.com> wrote:
>
>> Hi Jonathan -
>>
>> It looks like these machines are configured to use CPU 0 for all I/O
>> interrupts.  I don't think I'm going to get the OK to compile a new kernel
>> for them to balance the interrupts across CPUs, but to mitigate the problem
>> I taskset the Cassandra process to run on all CPU except 0.  It didn't
>> change the performance though.  Let me know if you think it's crucial that
>> we balance the interrupts across CPUs and I can try to lobby for a new
>> kernel.
>>
>> Here are flamegraphs from each node from a cassandra-stress ingest into
>> a table representative of the what we are going to be using.   This table
>> is also roughly 200 bytes, with 64 columns and the primary key (date,
>> sequence_number).  Cassandra-stress was run on 3 separate client
>> machines.  Using cassandra-stress to write to this table I see the same
>> thing: neither disk, CPU or network is fully utilized.
>>
>>    - http://sourcedelica.com/wordpress/wp-content/uploads/
>>    2017/05/flamegraph_ultva01_sars.svg
>>    <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_sars.svg>
>>    - http://sourcedelica.com/wordpress/wp-content/uploads/
>>    2017/05/flamegraph_ultva02_sars.svg
>>    <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_sars.svg>
>>    - http://sourcedelica.com/wordpress/wp-content/uploads/
>>    2017/05/flamegraph_ultva03_sars.svg
>>    <http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_sars.svg>
>>
>> Re: GC: In the stress run with the parameters above, two of the three
>> nodes log zero or one GCInspectors.  On the other hand, the 3rd machine
>> logs a GCInspector every 5 seconds or so, 300-500ms each time.  I found
>> out that the 3rd machine actually has different specs as the other two.
>> It's an older box with the same RAM but less CPUs (32 instead of 48), a
>> slower SSD and slower memory.   The Cassandra configuration is exactly the
>> same.   I tried running Cassandra with only 32 CPUs on the newer boxes to
>> see if that would cause them to GC pause more, but it didn't.
>>
>> On a separate topic - for this cassandra-stress run I reduced the batch
>> size to 2 in order to keep the logs clean.  That also reduced the
>> throughput from around 100k rows/second to 32k rows/sec.  I've been doing
>> ingestion tests using cassandra-stress, cqlsh COPY FROM and a custom C++
>> application.  In most of the tests that I've been doing I've been using a
>> batch size of around 20 (unlogged, all batch rows have the same partition
>> key).  However, it fills the logs with batch size warnings.  I was going to
>> raise the batch warning size but the docs scared me away from doing that.
>> Given that we're using unlogged/same partition batches is it safe to raise
>> the batch size warning limit?   Actually cqlsh COPY FROM has very good
>> throughput using a small batch size, but I can't get that same throughput
>> in cassandra-stress or my C++ app with a batch size of 2.
>>
>> Thanks!
>>
>>
>>
>> -- Eric
>>
>> On Mon, May 22, 2017 at 5:08 PM, Jonathan Haddad <jo...@jonhaddad.com>
>> wrote:
>>
>>> How many CPUs are you using for interrupts?  http://www.alexonlinux.com/
>>> smp-affinity-and-proper-interrupt-handling-in-linux
>>>
>>> Have you tried making a flame graph to see where Cassandra is spending
>>> its time? http://www.brendangregg.com/blog/2014-06-12/java-
>>> flame-graphs.html
>>>
>>> Are you tracking GC pauses?
>>>
>>> Jon
>>>
>>> On Mon, May 22, 2017 at 2:03 PM Eric Pederson <er...@gmail.com> wrote:
>>>
>>>> Hi all:
>>>>
>>>> I'm new to Cassandra and I'm doing some performance testing.  One of
>>>> things that I'm testing is ingestion throughput.   My server setup is:
>>>>
>>>>    - 3 node cluster
>>>>    - SSD data (both commit log and sstables are on the same disk)
>>>>    - 64 GB RAM per server
>>>>    - 48 cores per server
>>>>    - Cassandra 3.0.11
>>>>    - 48 Gb heap using G1GC
>>>>    - 1 Gbps NICs
>>>>
>>>> Since I'm using SSD I've tried tuning the following (one at a time) but
>>>> none seemed to make a lot of difference:
>>>>
>>>>    - concurrent_writes=384
>>>>    - memtable_flush_writers=8
>>>>    - concurrent_compactors=8
>>>>
>>>> I am currently doing ingestion tests sending data from 3 clients on the
>>>> same subnet.  I am using cassandra-stress to do some ingestion testing.
>>>> The tests are using CL=ONE and RF=2.
>>>>
>>>> Using cassandra-stress (3.10) I am able to saturate the disk using a
>>>> large enough column size and the standard five column cassandra-stress
>>>> schema.  For example, -col size=fixed(400) will saturate the disk and
>>>> compactions will start falling behind.
>>>>
>>>> One of our main tables has a row size that approximately 200 bytes,
>>>> across 64 columns.  When ingesting this table I don't see any resource
>>>> saturation.  Disk utilization is around 10-15% per iostat.  Incoming
>>>> network traffic on the servers is around 100-300 Mbps.  CPU utilization is
>>>> around 20-70%.  nodetool tpstats shows mostly zeros with occasional
>>>> spikes around 500 in MutationStage.
>>>>
>>>> The stress run does 10,000,000 inserts per client, each with a separate
>>>> range of partition IDs.  The run with 200 byte rows takes about 4 minutes,
>>>> with mean Latency 4.5ms, Total GC time of 21 secs, Avg GC time 173 ms.
>>>>
>>>> The overall performance is good - around 120k rows/sec ingested.  But
>>>> I'm curious to know where the bottleneck is.  There's no resource
>>>> saturation and nodetool tpstats shows only occasional brief queueing.
>>>> Is the rest just expected latency inside of Cassandra?
>>>>
>>>> Thanks,
>>>>
>>>> -- Eric
>>>>
>>>
>>

Re: Bottleneck for small inserts?

Posted by Jonathan Haddad <jo...@jonhaddad.com>.

You shouldn't need a kernel recompile.  Check out the section "Simple
solution for the problem" in
http://www.alexonlinux.com/smp-affinity-and-proper-interrupt-handling-in-linux.
You can balance your requests across up to 8 CPUs.

I'll check out the flame graphs in a little bit - in the middle of
something and my brain doesn't multitask well :)

On Thu, May 25, 2017 at 1:06 PM Eric Pederson <er...@gmail.com> wrote:

> Hi Jonathan -
>
> It looks like these machines are configured to use CPU 0 for all I/O
> interrupts.  I don't think I'm going to get the OK to compile a new kernel
> for them to balance the interrupts across CPUs, but to mitigate the problem
> I taskset the Cassandra process to run on all CPU except 0.  It didn't
> change the performance though.  Let me know if you think it's crucial that
> we balance the interrupts across CPUs and I can try to lobby for a new
> kernel.
>
> Here are flamegraphs from each node from a cassandra-stress ingest into a
> table representative of the what we are going to be using.   This table is
> also roughly 200 bytes, with 64 columns and the primary key (date,
> sequence_number).  Cassandra-stress was run on 3 separate client
> machines.  Using cassandra-stress to write to this table I see the same
> thing: neither disk, CPU or network is fully utilized.
>
>    -
>    http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_sars.svg
>    -
>    http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_sars.svg
>    -
>    http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_sars.svg
>
> Re: GC: In the stress run with the parameters above, two of the three
> nodes log zero or one GCInspectors.  On the other hand, the 3rd machine
> logs a GCInspector every 5 seconds or so, 300-500ms each time.  I found
> out that the 3rd machine actually has different specs as the other two.
> It's an older box with the same RAM but less CPUs (32 instead of 48), a
> slower SSD and slower memory.   The Cassandra configuration is exactly the
> same.   I tried running Cassandra with only 32 CPUs on the newer boxes to
> see if that would cause them to GC pause more, but it didn't.
>
> On a separate topic - for this cassandra-stress run I reduced the batch
> size to 2 in order to keep the logs clean.  That also reduced the
> throughput from around 100k rows/second to 32k rows/sec.  I've been doing
> ingestion tests using cassandra-stress, cqlsh COPY FROM and a custom C++
> application.  In most of the tests that I've been doing I've been using a
> batch size of around 20 (unlogged, all batch rows have the same partition
> key).  However, it fills the logs with batch size warnings.  I was going to
> raise the batch warning size but the docs scared me away from doing that.
> Given that we're using unlogged/same partition batches is it safe to raise
> the batch size warning limit?   Actually cqlsh COPY FROM has very good
> throughput using a small batch size, but I can't get that same throughput
> in cassandra-stress or my C++ app with a batch size of 2.
>
> Thanks!
>
>
>
> -- Eric
>
> On Mon, May 22, 2017 at 5:08 PM, Jonathan Haddad <jo...@jonhaddad.com>
> wrote:
>
>> How many CPUs are you using for interrupts?
>> http://www.alexonlinux.com/smp-affinity-and-proper-interrupt-handling-in-linux
>>
>> Have you tried making a flame graph to see where Cassandra is spending
>> its time?
>> http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html
>>
>> Are you tracking GC pauses?
>>
>> Jon
>>
>> On Mon, May 22, 2017 at 2:03 PM Eric Pederson <er...@gmail.com> wrote:
>>
>>> Hi all:
>>>
>>> I'm new to Cassandra and I'm doing some performance testing.  One of
>>> things that I'm testing is ingestion throughput.   My server setup is:
>>>
>>>    - 3 node cluster
>>>    - SSD data (both commit log and sstables are on the same disk)
>>>    - 64 GB RAM per server
>>>    - 48 cores per server
>>>    - Cassandra 3.0.11
>>>    - 48 Gb heap using G1GC
>>>    - 1 Gbps NICs
>>>
>>> Since I'm using SSD I've tried tuning the following (one at a time) but
>>> none seemed to make a lot of difference:
>>>
>>>    - concurrent_writes=384
>>>    - memtable_flush_writers=8
>>>    - concurrent_compactors=8
>>>
>>> I am currently doing ingestion tests sending data from 3 clients on the
>>> same subnet.  I am using cassandra-stress to do some ingestion testing.
>>> The tests are using CL=ONE and RF=2.
>>>
>>> Using cassandra-stress (3.10) I am able to saturate the disk using a
>>> large enough column size and the standard five column cassandra-stress
>>> schema.  For example, -col size=fixed(400) will saturate the disk and
>>> compactions will start falling behind.
>>>
>>> One of our main tables has a row size that approximately 200 bytes,
>>> across 64 columns.  When ingesting this table I don't see any resource
>>> saturation.  Disk utilization is around 10-15% per iostat.  Incoming
>>> network traffic on the servers is around 100-300 Mbps.  CPU utilization is
>>> around 20-70%.  nodetool tpstats shows mostly zeros with occasional
>>> spikes around 500 in MutationStage.
>>>
>>> The stress run does 10,000,000 inserts per client, each with a separate
>>> range of partition IDs.  The run with 200 byte rows takes about 4 minutes,
>>> with mean Latency 4.5ms, Total GC time of 21 secs, Avg GC time 173 ms.
>>>
>>> The overall performance is good - around 120k rows/sec ingested.  But
>>> I'm curious to know where the bottleneck is.  There's no resource
>>> saturation and nodetool tpstats shows only occasional brief queueing.
>>> Is the rest just expected latency inside of Cassandra?
>>>
>>> Thanks,
>>>
>>> -- Eric
>>>
>>
>

Re: Bottleneck for small inserts?

Posted by Eric Pederson <er...@gmail.com>.

Hi Jonathan -

It looks like these machines are configured to use CPU 0 for all I/O
interrupts.  I don't think I'm going to get the OK to compile a new kernel
for them to balance the interrupts across CPUs, but to mitigate the problem
I taskset the Cassandra process to run on all CPU except 0.  It didn't
change the performance though.  Let me know if you think it's crucial that
we balance the interrupts across CPUs and I can try to lobby for a new
kernel.

Here are flamegraphs from each node from a cassandra-stress ingest into a
table representative of the what we are going to be using.   This table is
also roughly 200 bytes, with 64 columns and the primary key (date,
sequence_number).  Cassandra-stress was run on 3 separate client machines.
Using cassandra-stress to write to this table I see the same thing: neither
disk, CPU or network is fully utilized.

   -
   http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva01_sars.svg
   -
   http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva02_sars.svg
   -
   http://sourcedelica.com/wordpress/wp-content/uploads/2017/05/flamegraph_ultva03_sars.svg

Re: GC: In the stress run with the parameters above, two of the three nodes
log zero or one GCInspectors.  On the other hand, the 3rd machine logs a
GCInspector every 5 seconds or so, 300-500ms each time.  I found out that
the 3rd machine actually has different specs as the other two.  It's an
older box with the same RAM but less CPUs (32 instead of 48), a slower SSD
and slower memory.   The Cassandra configuration is exactly the same.   I
tried running Cassandra with only 32 CPUs on the newer boxes to see if that
would cause them to GC pause more, but it didn't.

On a separate topic - for this cassandra-stress run I reduced the batch
size to 2 in order to keep the logs clean.  That also reduced the
throughput from around 100k rows/second to 32k rows/sec.  I've been doing
ingestion tests using cassandra-stress, cqlsh COPY FROM and a custom C++
application.  In most of the tests that I've been doing I've been using a
batch size of around 20 (unlogged, all batch rows have the same partition
key).  However, it fills the logs with batch size warnings.  I was going to
raise the batch warning size but the docs scared me away from doing that.
Given that we're using unlogged/same partition batches is it safe to raise
the batch size warning limit?   Actually cqlsh COPY FROM has very good
throughput using a small batch size, but I can't get that same throughput
in cassandra-stress or my C++ app with a batch size of 2.

Thanks!

-- Eric

On Mon, May 22, 2017 at 5:08 PM, Jonathan Haddad <jo...@jonhaddad.com> wrote:

> How many CPUs are you using for interrupts?  http://www.alexonlinux.com/
> smp-affinity-and-proper-interrupt-handling-in-linux
>
> Have you tried making a flame graph to see where Cassandra is spending its
> time? http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html
>
> Are you tracking GC pauses?
>
> Jon
>
> On Mon, May 22, 2017 at 2:03 PM Eric Pederson <er...@gmail.com> wrote:
>
>> Hi all:
>>
>> I'm new to Cassandra and I'm doing some performance testing.  One of
>> things that I'm testing is ingestion throughput.   My server setup is:
>>
>>    - 3 node cluster
>>    - SSD data (both commit log and sstables are on the same disk)
>>    - 64 GB RAM per server
>>    - 48 cores per server
>>    - Cassandra 3.0.11
>>    - 48 Gb heap using G1GC
>>    - 1 Gbps NICs
>>
>> Since I'm using SSD I've tried tuning the following (one at a time) but
>> none seemed to make a lot of difference:
>>
>>    - concurrent_writes=384
>>    - memtable_flush_writers=8
>>    - concurrent_compactors=8
>>
>> I am currently doing ingestion tests sending data from 3 clients on the
>> same subnet.  I am using cassandra-stress to do some ingestion testing.
>> The tests are using CL=ONE and RF=2.
>>
>> Using cassandra-stress (3.10) I am able to saturate the disk using a
>> large enough column size and the standard five column cassandra-stress
>> schema.  For example, -col size=fixed(400) will saturate the disk and
>> compactions will start falling behind.
>>
>> One of our main tables has a row size that approximately 200 bytes,
>> across 64 columns.  When ingesting this table I don't see any resource
>> saturation.  Disk utilization is around 10-15% per iostat.  Incoming
>> network traffic on the servers is around 100-300 Mbps.  CPU utilization is
>> around 20-70%.  nodetool tpstats shows mostly zeros with occasional
>> spikes around 500 in MutationStage.
>>
>> The stress run does 10,000,000 inserts per client, each with a separate
>> range of partition IDs.  The run with 200 byte rows takes about 4 minutes,
>> with mean Latency 4.5ms, Total GC time of 21 secs, Avg GC time 173 ms.
>>
>> The overall performance is good - around 120k rows/sec ingested.  But I'm
>> curious to know where the bottleneck is.  There's no resource saturation and
>> nodetool tpstats shows only occasional brief queueing.  Is the rest just
>> expected latency inside of Cassandra?
>>
>> Thanks,
>>
>> -- Eric
>>
>

Re: Bottleneck for small inserts?

Posted by Cogumelos Maravilha <co...@sapo.pt>.

Hi,

Change to *|durable_writes = false|*

And please post the results.

Thanks.

On 05/22/2017 10:08 PM, Jonathan Haddad wrote:
> How many CPUs are you using for interrupts?
>  http://www.alexonlinux.com/smp-affinity-and-proper-interrupt-handling-in-linux
>
> Have you tried making a flame graph to see where Cassandra is spending
> its
> time? http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html
>
> Are you tracking GC pauses?
>
> Jon
>
> On Mon, May 22, 2017 at 2:03 PM Eric Pederson <ericacm@gmail.com
> <ma...@gmail.com>> wrote:
>
>     Hi all:
>
>     I'm new to Cassandra and I'm doing some performance testing.  One
>     of things that I'm testing is ingestion throughput.   My server
>     setup is:
>
>       * 3 node cluster
>       * SSD data (both commit log and sstables are on the same disk)
>       * 64 GB RAM per server
>       * 48 cores per server
>       * Cassandra 3.0.11
>       * 48 Gb heap using G1GC
>       * 1 Gbps NICs
>
>     Since I'm using SSD I've tried tuning the following (one at a
>     time) but none seemed to make a lot of difference:
>
>       * concurrent_writes=384
>       * memtable_flush_writers=8
>       * concurrent_compactors=8
>
>     I am currently doing ingestion tests sending data from 3 clients
>     on the same subnet.  I am using cassandra-stress to do some
>     ingestion testing.  The tests are using CL=ONE and RF=2.
>
>     Using cassandra-stress (3.10) I am able to saturate the disk using
>     a large enough column size and the standard five column
>     cassandra-stress schema.  For example, -col size=fixed(400) will
>     saturate the disk and compactions will start falling behind. 
>
>     One of our main tables has a row size that approximately 200
>     bytes, across 64 columns.  When ingesting this table I don't see
>     any resource saturation.  Disk utilization is around 10-15% per
>     iostat.  Incoming network traffic on the servers is around 100-300
>     Mbps.  CPU utilization is around 20-70%.  nodetool tpstats shows
>     mostly zeros with occasional spikes around 500 in MutationStage.  
>
>     The stress run does 10,000,000 inserts per client, each with a
>     separate range of partition IDs.  The run with 200 byte rows takes
>     about 4 minutes, with mean Latency 4.5ms, Total GC time of 21
>     secs, Avg GC time 173 ms.   
>
>     The overall performance is good - around 120k rows/sec ingested. 
>     But I'm curious to know where the bottleneck is.  There's no
>     resource saturation andnodetool tpstats shows only occasional
>     brief queueing.  Is the rest just expected latency inside of
>     Cassandra?
>
>     Thanks,
>
>     -- Eric
>

Re: Bottleneck for small inserts?

Posted by Jonathan Haddad <jo...@jonhaddad.com>.

How many CPUs are you using for interrupts?
http://www.alexonlinux.com/smp-affinity-and-proper-interrupt-handling-in-linux

Have you tried making a flame graph to see where Cassandra is spending its
time? http://www.brendangregg.com/blog/2014-06-12/java-flame-graphs.html

Are you tracking GC pauses?

Jon

On Mon, May 22, 2017 at 2:03 PM Eric Pederson <er...@gmail.com> wrote:

> Hi all:
>
> I'm new to Cassandra and I'm doing some performance testing.  One of
> things that I'm testing is ingestion throughput.   My server setup is:
>
>    - 3 node cluster
>    - SSD data (both commit log and sstables are on the same disk)
>    - 64 GB RAM per server
>    - 48 cores per server
>    - Cassandra 3.0.11
>    - 48 Gb heap using G1GC
>    - 1 Gbps NICs
>
> Since I'm using SSD I've tried tuning the following (one at a time) but
> none seemed to make a lot of difference:
>
>    - concurrent_writes=384
>    - memtable_flush_writers=8
>    - concurrent_compactors=8
>
> I am currently doing ingestion tests sending data from 3 clients on the
> same subnet.  I am using cassandra-stress to do some ingestion testing.
> The tests are using CL=ONE and RF=2.
>
> Using cassandra-stress (3.10) I am able to saturate the disk using a large
> enough column size and the standard five column cassandra-stress schema.
> For example, -col size=fixed(400) will saturate the disk and compactions
> will start falling behind.
>
> One of our main tables has a row size that approximately 200 bytes, across
> 64 columns.  When ingesting this table I don't see any resource
> saturation.  Disk utilization is around 10-15% per iostat.  Incoming
> network traffic on the servers is around 100-300 Mbps.  CPU utilization is
> around 20-70%.  nodetool tpstats shows mostly zeros with occasional
> spikes around 500 in MutationStage.
>
> The stress run does 10,000,000 inserts per client, each with a separate
> range of partition IDs.  The run with 200 byte rows takes about 4 minutes,
> with mean Latency 4.5ms, Total GC time of 21 secs, Avg GC time 173 ms.
>
> The overall performance is good - around 120k rows/sec ingested.  But I'm
> curious to know where the bottleneck is.  There's no resource saturation and
> nodetool tpstats shows only occasional brief queueing.  Is the rest just
> expected latency inside of Cassandra?
>
> Thanks,
>
> -- Eric
>

Re: Bottleneck for small inserts?

Posted by Patrick McFadin <pm...@gmail.com>.

When you are running a stress test, 1-1 match client to server won't
saturate a cluster. I would go closer to 3-5 clients per server, so 10-15
clients against your 3 node cluster.

Patrick

On Tue, May 23, 2017 at 4:18 PM, Jeff Jirsa <jj...@apache.org> wrote:

>
> Are the 3 sending clients maxed out?
> Are you seeing JVM GC pauses?
>
>
> On 2017-05-22 14:02 (-0700), Eric Pederson <er...@gmail.com> wrote:
> > Hi all:
> >
> > I'm new to Cassandra and I'm doing some performance testing.  One of
> things
> > that I'm testing is ingestion throughput.   My server setup is:
> >
> >    - 3 node cluster
> >    - SSD data (both commit log and sstables are on the same disk)
> >    - 64 GB RAM per server
> >    - 48 cores per server
> >    - Cassandra 3.0.11
> >    - 48 Gb heap using G1GC
> >    - 1 Gbps NICs
> >
> > Since I'm using SSD I've tried tuning the following (one at a time) but
> > none seemed to make a lot of difference:
> >
> >    - concurrent_writes=384
> >    - memtable_flush_writers=8
> >    - concurrent_compactors=8
> >
> > I am currently doing ingestion tests sending data from 3 clients on the
> > same subnet.  I am using cassandra-stress to do some ingestion testing.
> > The tests are using CL=ONE and RF=2.
> >
> > Using cassandra-stress (3.10) I am able to saturate the disk using a
> large
> > enough column size and the standard five column cassandra-stress schema.
> > For example, -col size=fixed(400) will saturate the disk and compactions
> > will start falling behind.
> >
> > One of our main tables has a row size that approximately 200 bytes,
> across
> > 64 columns.  When ingesting this table I don't see any resource
> > saturation.  Disk utilization is around 10-15% per iostat.  Incoming
> > network traffic on the servers is around 100-300 Mbps.  CPU utilization
> is
> > around 20-70%.  nodetool tpstats shows mostly zeros with occasional
> spikes
> > around 500 in MutationStage.
> >
> > The stress run does 10,000,000 inserts per client, each with a separate
> > range of partition IDs.  The run with 200 byte rows takes about 4
> minutes,
> > with mean Latency 4.5ms, Total GC time of 21 secs, Avg GC time 173 ms.
> >
> > The overall performance is good - around 120k rows/sec ingested.  But I'm
> > curious to know where the bottleneck is.  There's no resource saturation
> and
> > nodetool tpstats shows only occasional brief queueing.  Is the rest just
> > expected latency inside of Cassandra?
> >
> > Thanks,
> >
> > -- Eric
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
> For additional commands, e-mail: user-help@cassandra.apache.org
>
>

Re: Bottleneck for small inserts?

Posted by Jeff Jirsa <jj...@apache.org>.

Are the 3 sending clients maxed out?
Are you seeing JVM GC pauses?


On 2017-05-22 14:02 (-0700), Eric Pederson <er...@gmail.com> wrote: 
> Hi all:
> 
> I'm new to Cassandra and I'm doing some performance testing.  One of things
> that I'm testing is ingestion throughput.   My server setup is:
> 
>    - 3 node cluster
>    - SSD data (both commit log and sstables are on the same disk)
>    - 64 GB RAM per server
>    - 48 cores per server
>    - Cassandra 3.0.11
>    - 48 Gb heap using G1GC
>    - 1 Gbps NICs
> 
> Since I'm using SSD I've tried tuning the following (one at a time) but
> none seemed to make a lot of difference:
> 
>    - concurrent_writes=384
>    - memtable_flush_writers=8
>    - concurrent_compactors=8
> 
> I am currently doing ingestion tests sending data from 3 clients on the
> same subnet.  I am using cassandra-stress to do some ingestion testing.
> The tests are using CL=ONE and RF=2.
> 
> Using cassandra-stress (3.10) I am able to saturate the disk using a large
> enough column size and the standard five column cassandra-stress schema.
> For example, -col size=fixed(400) will saturate the disk and compactions
> will start falling behind.
> 
> One of our main tables has a row size that approximately 200 bytes, across
> 64 columns.  When ingesting this table I don't see any resource
> saturation.  Disk utilization is around 10-15% per iostat.  Incoming
> network traffic on the servers is around 100-300 Mbps.  CPU utilization is
> around 20-70%.  nodetool tpstats shows mostly zeros with occasional spikes
> around 500 in MutationStage.
> 
> The stress run does 10,000,000 inserts per client, each with a separate
> range of partition IDs.  The run with 200 byte rows takes about 4 minutes,
> with mean Latency 4.5ms, Total GC time of 21 secs, Avg GC time 173 ms.
> 
> The overall performance is good - around 120k rows/sec ingested.  But I'm
> curious to know where the bottleneck is.  There's no resource saturation and
> nodetool tpstats shows only occasional brief queueing.  Is the rest just
> expected latency inside of Cassandra?
> 
> Thanks,
> 
> -- Eric
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org