You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Jeremy Davis <je...@gmail.com> on 2010/08/10 01:00:11 UTC

Using a separate commit log drive was 4x slower

I have a weird one to share with the list, Using a separate commit log drive
dropped my performance a lot more than I would expect...

I'm doing perf tests on 3 identical machines but with 3 different drive
sets. (SAS 15K,10K, and SATA 7.5K)
Each system has a single system disk (Same as the data set) and the data set
( a 5 drive RAID-0 )

I'm using Cassandra 0.6.4 with Java 1.6_20.
This is all RF=1, CL=1

I inserted an initial data set of 100K keys (each with 1000 columns of
random data (1000 bytes). Compacted and restarted Cassandra.

Then I did a write baseline where I have 500 threads inserting a random 1000
bytes on a random key/column combination (always 1 column per request).
If my commit log is on my RAID'd Data drive I get about 19K Columns/Inserts
per second.

If I then add some random reads ( 30 threads doing a random read Key/Column
read - always 1 column per read) I get ~ 8K Reads/Writes per second

 Host
 Write Baseline Columns Per Second.
 Write Columns Per Second.
 Read Columns Per Second.
  SAS15K
 18800
 8700
 8100
  SAS10K
 15800
 7600
 7300
  SATA
 13200
 7300
 8000

Now, if I do the same thing but with the commit log on the system disk, I
get:

 Host
 Write Baseline Columns Per Second.
 Write Columns Per Second.
 Read Columns Per Second.
  SAS15K
 12600
 2200
 1600
  SAS10K
 11400
 3000
 1900
  SATA
 9900
 3100
 1800


I would think that the Write level would stay at about the baseline, and I
have no idea why the read level would be so low.

Any thoughts?


Some iostat (while separate commit log):

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          38.33    0.00    4.72    2.48    0.00   54.47

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
cciss/c0d0        0.00   908.50    0.00  110.50     0.00  8152.00
73.77     0.57    5.20   4.93  54.50
cciss/c0d1        0.00     0.00   16.50    0.00  1424.00     0.00
86.30     0.10    6.06   2.73   4.50
dm-0              0.00     0.00    0.00 1019.00     0.00  8152.00
8.00     6.25    6.13   0.53  54.50
dm-1              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00   0.00   0.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          37.42    0.00    2.37    3.54    0.00   56.68

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
cciss/c0d0        0.00   854.50    0.00  124.50     0.00  7816.00
62.78     0.61    4.94   4.82  60.00
cciss/c0d1        0.00     0.00   32.00    0.00  4032.00     0.00
126.00     0.21    6.72   3.12  10.00
dm-0              0.00     0.00    0.00  979.50     0.00  7836.00
8.00     5.57    5.69   0.61  60.00
dm-1              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00   0.00   0.00


some top (while separate commit log):

top - 15:56:38 up 6 days, 21:26,  9 users,  load average: 17.09, 7.92, 6.87
Tasks: 358 total,   1 running, 357 sleeping,   0 stopped,   0 zombie
Cpu(s): 35.4%us,  1.6%sy,  0.0%ni, 59.4%id,  3.2%wa,  0.0%hi,  0.5%si,
0.0%st
Mem:  24729068k total, 19789732k used,  4939336k free,   132056k buffers
Swap:  5849080k total,    54976k used,  5794104k free, 14839884k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+
COMMAND

19411 root      20   0  142g 8.2g 5.0g S  599 34.9 423:25.42 java


----

<Storage>
  <ClusterName>Test Cluster</ClusterName>
  <AutoBootstrap>false</AutoBootstrap>
  <HintedHandoffEnabled>true</HintedHandoffEnabled>

<Keyspaces>
    <Keyspace Name="Keyspace1">
      <ColumnFamily Name="PerfTest" CompareWith="LongType"/>

<ReplicaPlacementStrategy>org.apache.cassandra.locator.RackUnawareStrategy</ReplicaPlacementStrategy>
      <ReplicationFactor>1</ReplicationFactor>

<EndPointSnitch>org.apache.cassandra.locator.EndPointSnitch</EndPointSnitch>
    </Keyspace>

  </Keyspaces>


<Authenticator>org.apache.cassandra.auth.AllowAllAuthenticator</Authenticator>

<Partitioner>org.apache.cassandra.dht.OrderPreservingPartitioner</Partitioner>
  <InitialToken></InitialToken>
  <CommitLogDirectory>/data/commitlog</CommitLogDirectory>
  <DataFileDirectories>
      <DataFileDirectory>/data/data</DataFileDirectory>
  </DataFileDirectories>
  <Seeds>
      <Seed>127.0.0.1</Seed>
  </Seeds>

<RpcTimeoutInMillis>10000</RpcTimeoutInMillis>
  <CommitLogRotationThresholdInMB>1024</CommitLogRotationThresholdInMB>
  <ListenAddress>10.2.60.20</ListenAddress>
  <StoragePort>7000</StoragePort>
  <ThriftAddress>10.2.60.20</ThriftAddress>
  <ThriftPort>9160</ThriftPort>
  <ThriftFramedTransport>false</ThriftFramedTransport>
  <DiskAccessMode>auto</DiskAccessMode>
  <RowWarningThresholdInMB>512</RowWarningThresholdInMB>
  <SlicedBufferSizeInKB>64</SlicedBufferSizeInKB>

<FlushDataBufferSizeInMB>32</FlushDataBufferSizeInMB>
  <FlushIndexBufferSizeInMB>8</FlushIndexBufferSizeInMB>
  <ColumnIndexSizeInKB>64</ColumnIndexSizeInKB>
  <MemtableThroughputInMB>512</MemtableThroughputInMB>
  <BinaryMemtableThroughputInMB>256</BinaryMemtableThroughputInMB>
  <MemtableOperationsInMillions>1.2</MemtableOperationsInMillions>
  <MemtableFlushAfterMinutes>60</MemtableFlushAfterMinutes>
  <ConcurrentReads>16</ConcurrentReads>
  <ConcurrentWrites>512</ConcurrentWrites>

  <CommitLogSync>batch</CommitLogSync>
  <CommitLogSyncBatchWindowInMS>5</CommitLogSyncBatchWindowInMS>
  <GCGraceSeconds>864000</GCGraceSeconds>
</Storage>

Re: Using a separate commit log drive was 4x slower

Posted by Jonathan Ellis <jb...@gmail.com>.
Other activity, e.g. syslog?

Journaling at the FS level? you could try making a small partition
formatted as ext2.

On Tue, Aug 10, 2010 at 5:10 PM, Jeremy Davis
<je...@gmail.com> wrote:
> Yeah, it has a BBU, and it is charged and on..
> Very odd behavior, I'm stumped.
>
> -JD
>
> On Tue, Aug 10, 2010 at 12:28 AM, Peter Schuller
> <pe...@infidyne.com> wrote:
>>
>> I have no explanation for the slower reads, but I have an hypothesis
>> on the writes.
>>
>> Your iostat shows:
>>
>> > Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> > avgrq-sz avgqu-sz   await  svctm  %util
>> > cciss/c0d0        0.00   908.50    0.00  110.50     0.00  8152.00
>> > 73.77     0.57    5.20   4.93  54.50
>> > cciss/c0d1        0.00     0.00   16.50    0.00  1424.00     0.00
>> > 86.30     0.10    6.06   2.73   4.50
>> > dm-0              0.00     0.00    0.00 1019.00     0.00  8152.00
>> > 8.00     6.25    6.13   0.53  54.50
>> > dm-1              0.00     0.00    0.00    0.00     0.00     0.00
>> > 0.00     0.00    0.00   0.00   0.00
>>
>> So that's keeping the disk busy around 50-60% of the time. This seems
>> roughly consistent with your commit batch window being set to 5 ms and
>> the system drive NOT being supported by a battery-backed cache (such
>> that an fsync() actually does have to wait for the data to be on the
>> platter).
>>
>> Is your non-system drive backed by the BBU? (I'm not sure if the
>> controllers would support having some volumes backed by BBU protected
>> cache and not others.)
>>
>> If it is the case that the other volume is BBU backed, then maybe the
>> slowdown on writes is due to this. In any case, whatever the situation
>> was before, the above stats do seem roughly consistent with
>> write-through fsync() and batch window of 5ms, given sufficient
>> concurrency to achieve the throughput you're seeing. On the other
>> hand, "roughly consistent" is not very precise, and the original
>> performance on the RAID:ed device is probably also roughly consistent
>> with this ;)
>>
>> --
>> / Peter Schuller
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Using a separate commit log drive was 4x slower

Posted by Jeremy Davis <je...@gmail.com>.
Thanks for all the feedback, I'll be back in 2 weeks and pick up then.
-JD

On Tue, Aug 10, 2010 at 3:45 PM, Peter Schuller <peter.schuller@infidyne.com
> wrote:

> > Yeah, it has a BBU, and it is charged and on..
> > Very odd behavior, I'm stumped.
>
> I advise double-checking raid volume settings and ensuring that policy
> is truly such that the write cache is used. This may also be a
> function of kernel driver settings depending on what RAID
> controller/kernel version you have (for example make sure that an
> fsync() doesn't result in asking the RAID controller to flush caches
> regardless of BBU state). In any case, these stats:
>
> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> avgrq-sz avgqu-sz   await  svctm  %util
> cciss/c0d0        0.00   908.50    0.00  110.50     0.00  8152.00
> 73.77     0.57    5.20   4.93  54.50
>
> ... are highly inconsistent with write-back caching under the
> assumption that the writes are indeed the sequential writes to the
> commit log. Write counts in the ~ 100/second ballpark, with no reads,
> average request size of 74, an average transaction time of 4.93 and a
> utilization of 54% *really really* sounds like the I/O is going all
> the way down to the platter. Either that or the RAID firmware/driver
> is not doing its job properly.
>
> I've attached a small script (note: UGLY hack since I just whipped it
> up, not a proper tool, but it does the job for this) that you can run
> to test it:
>
>   ./writelat.py /path/to/file_to_write_to # warning, file will be
> truncated/destroyed if it exists
>
> If you run this on an idle system and you're truly doing write-back
> caching, you should see numbers BELOW 1 (i.e., sub-millisecond times)
> (but you can ignore the first sample probably).
>
> It writes one 8kb chunk of data, fsync:s, writes another 8kb, fsyncs,
> etc. I predict you'll see numbers in the 3-7 millisecond range.
>
> Sample output for me with a plain old SATA drive (and ZFS on FreeBSD)
> follows.
>
> 33.8141918182
> 8.32605361938
> 8.44812393188
> 8.44788551331
> 8.40210914612
> 8.4490776062
> 8.38303565979
> 8.57901573181
> 8.20922851562
> 8.21614265442
> 10.0581645966
> 8.37683677673
> 8.50605964661
> 8.376121521
> 9.86790657043
> 8.43715667725
> 8.45789909363
> 8.40520858765
> 8.4171295166
> 8.46195220947
> 8.41498374939
> 8.46099853516
> 8.44287872314
> 8.43000411987
> 8.455991745
>
> --
> / Peter Schuller
>

Re: Using a separate commit log drive was 4x slower

Posted by Peter Schuller <pe...@infidyne.com>.
> Yeah, it has a BBU, and it is charged and on..
> Very odd behavior, I'm stumped.

I advise double-checking raid volume settings and ensuring that policy
is truly such that the write cache is used. This may also be a
function of kernel driver settings depending on what RAID
controller/kernel version you have (for example make sure that an
fsync() doesn't result in asking the RAID controller to flush caches
regardless of BBU state). In any case, these stats:

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
avgrq-sz avgqu-sz   await  svctm  %util
cciss/c0d0        0.00   908.50    0.00  110.50     0.00  8152.00
73.77     0.57    5.20   4.93  54.50

... are highly inconsistent with write-back caching under the
assumption that the writes are indeed the sequential writes to the
commit log. Write counts in the ~ 100/second ballpark, with no reads,
average request size of 74, an average transaction time of 4.93 and a
utilization of 54% *really really* sounds like the I/O is going all
the way down to the platter. Either that or the RAID firmware/driver
is not doing its job properly.

I've attached a small script (note: UGLY hack since I just whipped it
up, not a proper tool, but it does the job for this) that you can run
to test it:

   ./writelat.py /path/to/file_to_write_to # warning, file will be
truncated/destroyed if it exists

If you run this on an idle system and you're truly doing write-back
caching, you should see numbers BELOW 1 (i.e., sub-millisecond times)
(but you can ignore the first sample probably).

It writes one 8kb chunk of data, fsync:s, writes another 8kb, fsyncs,
etc. I predict you'll see numbers in the 3-7 millisecond range.

Sample output for me with a plain old SATA drive (and ZFS on FreeBSD) follows.

33.8141918182
8.32605361938
8.44812393188
8.44788551331
8.40210914612
8.4490776062
8.38303565979
8.57901573181
8.20922851562
8.21614265442
10.0581645966
8.37683677673
8.50605964661
8.376121521
9.86790657043
8.43715667725
8.45789909363
8.40520858765
8.4171295166
8.46195220947
8.41498374939
8.46099853516
8.44287872314
8.43000411987
8.455991745

-- 
/ Peter Schuller

Re: Using a separate commit log drive was 4x slower

Posted by Jeremy Davis <je...@gmail.com>.
Yeah, it has a BBU, and it is charged and on..
Very odd behavior, I'm stumped.

-JD

On Tue, Aug 10, 2010 at 12:28 AM, Peter Schuller <
peter.schuller@infidyne.com> wrote:

> I have no explanation for the slower reads, but I have an hypothesis
> on the writes.
>
> Your iostat shows:
>
> > Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> avgrq-sz avgqu-sz   await  svctm  %util
> > cciss/c0d0        0.00   908.50    0.00  110.50     0.00  8152.00
> 73.77     0.57    5.20   4.93  54.50
> > cciss/c0d1        0.00     0.00   16.50    0.00  1424.00     0.00
> 86.30     0.10    6.06   2.73   4.50
> > dm-0              0.00     0.00    0.00 1019.00     0.00  8152.00
> 8.00     6.25    6.13   0.53  54.50
> > dm-1              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00   0.00   0.00
>
> So that's keeping the disk busy around 50-60% of the time. This seems
> roughly consistent with your commit batch window being set to 5 ms and
> the system drive NOT being supported by a battery-backed cache (such
> that an fsync() actually does have to wait for the data to be on the
> platter).
>
> Is your non-system drive backed by the BBU? (I'm not sure if the
> controllers would support having some volumes backed by BBU protected
> cache and not others.)
>
> If it is the case that the other volume is BBU backed, then maybe the
> slowdown on writes is due to this. In any case, whatever the situation
> was before, the above stats do seem roughly consistent with
> write-through fsync() and batch window of 5ms, given sufficient
> concurrency to achieve the throughput you're seeing. On the other
> hand, "roughly consistent" is not very precise, and the original
> performance on the RAID:ed device is probably also roughly consistent
> with this ;)
>
> --
> / Peter Schuller
>

Re: Using a separate commit log drive was 4x slower

Posted by Peter Schuller <pe...@infidyne.com>.
I have no explanation for the slower reads, but I have an hypothesis
on the writes.

Your iostat shows:

> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
> cciss/c0d0        0.00   908.50    0.00  110.50     0.00  8152.00    73.77     0.57    5.20   4.93  54.50
> cciss/c0d1        0.00     0.00   16.50    0.00  1424.00     0.00    86.30     0.10    6.06   2.73   4.50
> dm-0              0.00     0.00    0.00 1019.00     0.00  8152.00     8.00     6.25    6.13   0.53  54.50
> dm-1              0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00

So that's keeping the disk busy around 50-60% of the time. This seems
roughly consistent with your commit batch window being set to 5 ms and
the system drive NOT being supported by a battery-backed cache (such
that an fsync() actually does have to wait for the data to be on the
platter).

Is your non-system drive backed by the BBU? (I'm not sure if the
controllers would support having some volumes backed by BBU protected
cache and not others.)

If it is the case that the other volume is BBU backed, then maybe the
slowdown on writes is due to this. In any case, whatever the situation
was before, the above stats do seem roughly consistent with
write-through fsync() and batch window of 5ms, given sufficient
concurrency to achieve the throughput you're seeing. On the other
hand, "roughly consistent" is not very precise, and the original
performance on the RAID:ed device is probably also roughly consistent
with this ;)

--
/ Peter Schuller