You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Ilya Maykov <iv...@gmail.com> on 2010/04/06 06:31:44 UTC

Overwhelming a cluster with writes?

Hi all,

I've just started experimenting with Cassandra to get a feel for the
system. I've set up a test cluster and to get a ballpark idea of its
performance I wrote a simple tool to load some toy data into the
system. Surprisingly, I am able to "overwhelm" my 4-node cluster with
writes from a single client. I'm trying to figure out if this is a
problem with my setup, if I'm hitting bugs in the Cassandra codebase,
or if this is intended behavior. Sorry this email is kind of long,
here is the TLDR version:

While writing to Cassandra from a single node, I am able to get the
cluster into a bad state, where nodes are randomly disconnecting from
each other, write performance plummets, and sometimes nodes even
crash. Further, the nodes do not recover as long as the writes
continue (even at a much lower rate), and sometimes do not recover at
all unless I restart them. I can get this to happen simply by throwing
data at the cluster fast enough, and I'm wondering if this is a known
issue or if I need to tweak my setup.

Now, the details.

First, a little bit about the setup:

4-node cluster of identical machines, running cassandra-0.6.0-rc1 with
the fixes for CASSANDRA-933, CASSANDRA-934, and CASSANDRA-936 patched
in. Node specs:
8-core Intel Xeon E5405@2.00GHz
8GB RAM
1Gbit ethernet
Red Hat Linux 2.6.18
JVM 1.6.0_19 64-bit
1TB spinning disk houses both commitlog and data directories (which I
know is not ideal).
The client machine is on the same local network and has very similar specs.

The cassandra nodes are started with the following JVM options:

./cassandra JVM_OPTS="-Xms6144m -Xmx6144m -XX:+UseConcMarkSweepGC -d64
-XX:NewSize=1024m -XX:MaxNewSize=1024m -XX:+DisableExplicitGC"

I'm using default settings for all of the tunable stuff at the bottom
of storage-conf.xml. I also selected my initial tokens to evenly
partition the key space when the cluster was bootstrapped. I am using
the RandomPartitioner.

Now, about the test. Basically I am trying to get an idea of just how
fast I can make this thing go. I am writing ~250M data records into
the cluster, replicated at 3x, using Ran Tavory's Hector client
(Java), writing with ConsistencyLevel.ZERO and
FailoverPolicy.FAIL_FAST. The client is using 32 threads with 8
threads talking to each of the 4 nodes in the cluster. Records are
identified by a numeric id, and I'm writing them in batches of up to
10k records per row, with each record in its own column. The row key
identifies the bucket into which records fall. So, records with ids 0
- 9999 are written to row "0", 10000 - 19999 are written to row
"10000", etc. Each record is a JSON object with ~10-20 fields.

Records: {  // Column Family
  0 : {  // row key for the start of the bucket. Buckets span a range
of up to 10000 records
    1 : "{ /* some JSON */ }",  // Column for record with id=1
    3 : "{ /* some more JSON */ }",  // Column for record with id=3
    ...
    9999 : "{ /* ... */ }"
  },
  10000 : {  // row key for the start of the next bucket
    10001 : ...
    10004 :
}

I am reading the data out of a local, sorted file on the client, so I
only write a row to Cassandra once all records for that row have been
read, and each row is written to exactly once. I'm using a
producer-consumer queue to pump data from the input reader thread to
the output writer threads. I found that I have to throttle the reader
thread heavily in order to get good behavior. So, if I make the reader
sleep for 7 seconds every 1M records, everything is fine - the data
loads in about an hour, half of which is spent by the reader thread
sleeping. In between the sleeps, I see ~40-50 MB/s throughput on the
client's network interface while the reader is not sleeping, and it
takes ~7-8 seconds to write each batch of 1M records.

Now, if I remove the 7 second sleeps on the client side, things get
bad after the first ~8M records are written to the client. Write
throughput drops to <5 MB/s. I start seeing messages about nodes
disconnecting and reconnecting in Cassandra's system.log, as well as
lots of GC messages:

...
 INFO [Timer-1] 2010-04-06 04:03:27,178 Gossiper.java (line 179)
InetAddress /10.15.38.88 is now dead.
 INFO [GC inspection] 2010-04-06 04:03:30,259 GCInspector.java (line
110) GC for ConcurrentMarkSweep: 2989 ms, 55326320 reclaimed leaving
1035998648 used; max is 1211170816
 INFO [GC inspection] 2010-04-06 04:03:41,838 GCInspector.java (line
110) GC for ConcurrentMarkSweep: 3004 ms, 24377240 reclaimed leaving
1066120952 used; max is 1211170816
 INFO [Timer-1] 2010-04-06 04:03:44,136 Gossiper.java (line 179)
InetAddress /10.15.38.55 is now dead.
 INFO [GMFD:1] 2010-04-06 04:03:44,138 Gossiper.java (line 568)
InetAddress /10.15.38.55 is now UP
 INFO [GC inspection] 2010-04-06 04:03:52,957 GCInspector.java (line
110) GC for ConcurrentMarkSweep: 2319 ms, 4504888 reclaimed leaving
1086023832 used; max is 1211170816
 INFO [Timer-1] 2010-04-06 04:04:19,508 Gossiper.java (line 179)
InetAddress /10.15.38.242 is now dead.
 INFO [Timer-1] 2010-04-06 04:05:03,039 Gossiper.java (line 179)
InetAddress /10.15.38.55 is now dead.
 INFO [GMFD:1] 2010-04-06 04:05:03,041 Gossiper.java (line 568)
InetAddress /10.15.38.55 is now UP
 INFO [GC inspection] 2010-04-06 04:05:08,539 GCInspector.java (line
110) GC for ConcurrentMarkSweep: 2375 ms, 39534920 reclaimed leaving
1051620856 used; max is 1211170816
...

Finally followed by this and some/all nodes going down:

ERROR [COMPACTION-POOL:1] 2010-04-06 04:05:05,475
DebuggableThreadPoolExecutor.java (line 94) Error in executor
futuretask
java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError:
Java heap space
	at java.util.concurrent.FutureTask$Sync.innerGet(Unknown Source)
	at java.util.concurrent.FutureTask.get(Unknown Source)
	at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.afterExecute(DebuggableThreadPoolExecutor.java:86)
	at org.apache.cassandra.db.CompactionManager$CompactionExecutor.afterExecute(CompactionManager.java:582)
	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOf(Unknown Source)
	at java.io.ByteArrayOutputStream.write(Unknown Source)
	at java.io.DataOutputStream.write(Unknown Source)
	at org.apache.cassandra.io.IteratingRow.echoData(IteratingRow.java:69)
	at org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:138)
	at org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:1)
	at org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73)
	at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
	at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
	at org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183)
	at org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94)
	at org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:299)
	at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:102)
	at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:1)
	at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
	at java.util.concurrent.FutureTask.run(Unknown Source)
	... 3 more

At first I thought that with ConsistencyLevel.ZERO I must be doing
async writes so Cassandra can't push back on the client threads (by
blocking them), thus the server is getting overwhelmed. But, I would
expect it to start dropping data and not crash in that case (after
all, I did say ZERO so I can't expect any reliability, right?).
However, I see similar slowdown / node dropout behavior when I set the
consistency level to ONE. Does Cassandra push back on writers under
heavy load? Is there some magic setting I need to tune to have it not
fall over? Do I just need a bigger cluster? Thanks in advance,

-- Ilya

P.S. I realize that it's still handling a LOT of data with just 4
nodes, and in practice nobody would run a system that gets 125k writes
per second on top of a 4 node cluster. I was just surprised that I
could make Cassandra fall over at all using a single client that's
pumping data at 40-50 MB/s.

Re: Overwhelming a cluster with writes?

Posted by Ilya Maykov <iv...@gmail.com>.

I just tried the same test with ConsistencyLevel.ALL, and the problem
went away - the writes are somewhat slower but the cluster never gets
into a bad state. So, I wonder if this is a bug in Cassandra's
handling of async / "non-ConsistencyLevel.ALL" writes ...

-- Ilya

On Mon, Apr 5, 2010 at 9:31 PM, Ilya Maykov <iv...@gmail.com> wrote:
> Hi all,
>
> I've just started experimenting with Cassandra to get a feel for the
> system. I've set up a test cluster and to get a ballpark idea of its
> performance I wrote a simple tool to load some toy data into the
> system. Surprisingly, I am able to "overwhelm" my 4-node cluster with
> writes from a single client. I'm trying to figure out if this is a
> problem with my setup, if I'm hitting bugs in the Cassandra codebase,
> or if this is intended behavior. Sorry this email is kind of long,
> here is the TLDR version:
>
> While writing to Cassandra from a single node, I am able to get the
> cluster into a bad state, where nodes are randomly disconnecting from
> each other, write performance plummets, and sometimes nodes even
> crash. Further, the nodes do not recover as long as the writes
> continue (even at a much lower rate), and sometimes do not recover at
> all unless I restart them. I can get this to happen simply by throwing
> data at the cluster fast enough, and I'm wondering if this is a known
> issue or if I need to tweak my setup.
>
> Now, the details.
>
> First, a little bit about the setup:
>
> 4-node cluster of identical machines, running cassandra-0.6.0-rc1 with
> the fixes for CASSANDRA-933, CASSANDRA-934, and CASSANDRA-936 patched
> in. Node specs:
> 8-core Intel Xeon E5405@2.00GHz
> 8GB RAM
> 1Gbit ethernet
> Red Hat Linux 2.6.18
> JVM 1.6.0_19 64-bit
> 1TB spinning disk houses both commitlog and data directories (which I
> know is not ideal).
> The client machine is on the same local network and has very similar specs.
>
> The cassandra nodes are started with the following JVM options:
>
> ./cassandra JVM_OPTS="-Xms6144m -Xmx6144m -XX:+UseConcMarkSweepGC -d64
> -XX:NewSize=1024m -XX:MaxNewSize=1024m -XX:+DisableExplicitGC"
>
> I'm using default settings for all of the tunable stuff at the bottom
> of storage-conf.xml. I also selected my initial tokens to evenly
> partition the key space when the cluster was bootstrapped. I am using
> the RandomPartitioner.
>
> Now, about the test. Basically I am trying to get an idea of just how
> fast I can make this thing go. I am writing ~250M data records into
> the cluster, replicated at 3x, using Ran Tavory's Hector client
> (Java), writing with ConsistencyLevel.ZERO and
> FailoverPolicy.FAIL_FAST. The client is using 32 threads with 8
> threads talking to each of the 4 nodes in the cluster. Records are
> identified by a numeric id, and I'm writing them in batches of up to
> 10k records per row, with each record in its own column. The row key
> identifies the bucket into which records fall. So, records with ids 0
> - 9999 are written to row "0", 10000 - 19999 are written to row
> "10000", etc. Each record is a JSON object with ~10-20 fields.
>
> Records: {  // Column Family
>   0 : {  // row key for the start of the bucket. Buckets span a range
> of up to 10000 records
>     1 : "{ /* some JSON */ }",  // Column for record with id=1
>     3 : "{ /* some more JSON */ }",  // Column for record with id=3
>    ...
>    9999 : "{ /* ... */ }"
>   },
>  10000 : {  // row key for the start of the next bucket
>    10001 : ...
>    10004 :
> }
>
> I am reading the data out of a local, sorted file on the client, so I
> only write a row to Cassandra once all records for that row have been
> read, and each row is written to exactly once. I'm using a
> producer-consumer queue to pump data from the input reader thread to
> the output writer threads. I found that I have to throttle the reader
> thread heavily in order to get good behavior. So, if I make the reader
> sleep for 7 seconds every 1M records, everything is fine - the data
> loads in about an hour, half of which is spent by the reader thread
> sleeping. In between the sleeps, I see ~40-50 MB/s throughput on the
> client's network interface while the reader is not sleeping, and it
> takes ~7-8 seconds to write each batch of 1M records.
>
> Now, if I remove the 7 second sleeps on the client side, things get
> bad after the first ~8M records are written to the client. Write
> throughput drops to <5 MB/s. I start seeing messages about nodes
> disconnecting and reconnecting in Cassandra's system.log, as well as
> lots of GC messages:
>
> ...
>  INFO [Timer-1] 2010-04-06 04:03:27,178 Gossiper.java (line 179)
> InetAddress /10.15.38.88 is now dead.
>  INFO [GC inspection] 2010-04-06 04:03:30,259 GCInspector.java (line
> 110) GC for ConcurrentMarkSweep: 2989 ms, 55326320 reclaimed leaving
> 1035998648 used; max is 1211170816
>  INFO [GC inspection] 2010-04-06 04:03:41,838 GCInspector.java (line
> 110) GC for ConcurrentMarkSweep: 3004 ms, 24377240 reclaimed leaving
> 1066120952 used; max is 1211170816
>  INFO [Timer-1] 2010-04-06 04:03:44,136 Gossiper.java (line 179)
> InetAddress /10.15.38.55 is now dead.
>  INFO [GMFD:1] 2010-04-06 04:03:44,138 Gossiper.java (line 568)
> InetAddress /10.15.38.55 is now UP
>  INFO [GC inspection] 2010-04-06 04:03:52,957 GCInspector.java (line
> 110) GC for ConcurrentMarkSweep: 2319 ms, 4504888 reclaimed leaving
> 1086023832 used; max is 1211170816
>  INFO [Timer-1] 2010-04-06 04:04:19,508 Gossiper.java (line 179)
> InetAddress /10.15.38.242 is now dead.
>  INFO [Timer-1] 2010-04-06 04:05:03,039 Gossiper.java (line 179)
> InetAddress /10.15.38.55 is now dead.
>  INFO [GMFD:1] 2010-04-06 04:05:03,041 Gossiper.java (line 568)
> InetAddress /10.15.38.55 is now UP
>  INFO [GC inspection] 2010-04-06 04:05:08,539 GCInspector.java (line
> 110) GC for ConcurrentMarkSweep: 2375 ms, 39534920 reclaimed leaving
> 1051620856 used; max is 1211170816
> ...
>
> Finally followed by this and some/all nodes going down:
>
> ERROR [COMPACTION-POOL:1] 2010-04-06 04:05:05,475
> DebuggableThreadPoolExecutor.java (line 94) Error in executor
> futuretask
> java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError:
> Java heap space
>        at java.util.concurrent.FutureTask$Sync.innerGet(Unknown Source)
>        at java.util.concurrent.FutureTask.get(Unknown Source)
>        at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.afterExecute(DebuggableThreadPoolExecutor.java:86)
>        at org.apache.cassandra.db.CompactionManager$CompactionExecutor.afterExecute(CompactionManager.java:582)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>        at java.lang.Thread.run(Unknown Source)
> Caused by: java.lang.OutOfMemoryError: Java heap space
>        at java.util.Arrays.copyOf(Unknown Source)
>        at java.io.ByteArrayOutputStream.write(Unknown Source)
>        at java.io.DataOutputStream.write(Unknown Source)
>        at org.apache.cassandra.io.IteratingRow.echoData(IteratingRow.java:69)
>        at org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:138)
>        at org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:1)
>        at org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73)
>        at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
>        at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
>        at org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183)
>        at org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94)
>        at org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:299)
>        at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:102)
>        at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:1)
>        at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
>        at java.util.concurrent.FutureTask.run(Unknown Source)
>        ... 3 more
>
> At first I thought that with ConsistencyLevel.ZERO I must be doing
> async writes so Cassandra can't push back on the client threads (by
> blocking them), thus the server is getting overwhelmed. But, I would
> expect it to start dropping data and not crash in that case (after
> all, I did say ZERO so I can't expect any reliability, right?).
> However, I see similar slowdown / node dropout behavior when I set the
> consistency level to ONE. Does Cassandra push back on writers under
> heavy load? Is there some magic setting I need to tune to have it not
> fall over? Do I just need a bigger cluster? Thanks in advance,
>
> -- Ilya
>
> P.S. I realize that it's still handling a LOT of data with just 4
> nodes, and in practice nobody would run a system that gets 125k writes
> per second on top of a 4 node cluster. I was just surprised that I
> could make Cassandra fall over at all using a single client that's
> pumping data at 40-50 MB/s.
>

Re: Overwhelming a cluster with writes?

Posted by Tatu Saloranta <ts...@gmail.com>.

On Tue, Apr 6, 2010 at 8:17 AM, Jonathan Ellis <jb...@gmail.com> wrote:
> On Tue, Apr 6, 2010 at 2:13 AM, Ilya Maykov <iv...@gmail.com> wrote:
>> That does sound similar. It's possible that the difference I'm seeing
>> between ConsistencyLevel.ZERO and ConsistencyLevel.ALL is simply due
>> to the fact that using ALL slows down the writers enough that the GC
>> can keep up.
>
> No, it's mostly due to ZERO meaning "buffer this locally and write it
> when it's convenient," and buffering takes memory.  If you check your
> tpstats you will see the pending ops through the roof on the node
> handling the thrift connections.
>

This sounds like a great FAQ entry? (apologies if it's already included)
So that ideally users would only use this setting if they (think they)
know what they are doing. :-)

-+ Tatu +-

Re: Overwhelming a cluster with writes?

Posted by Jonathan Ellis <jb...@gmail.com>.

On Tue, Apr 6, 2010 at 2:13 AM, Ilya Maykov <iv...@gmail.com> wrote:
> That does sound similar. It's possible that the difference I'm seeing
> between ConsistencyLevel.ZERO and ConsistencyLevel.ALL is simply due
> to the fact that using ALL slows down the writers enough that the GC
> can keep up.

No, it's mostly due to ZERO meaning "buffer this locally and write it
when it's convenient," and buffering takes memory.  If you check your
tpstats you will see the pending ops through the roof on the node
handling the thrift connections.

Re: Overwhelming a cluster with writes?

Posted by Benjamin Black <b...@b3k.us>.

OK, cool, looking forward to your results.

On Tue, Apr 6, 2010 at 12:18 AM, Ilya Maykov <iv...@gmail.com> wrote:
> Right, I meant 4GB heap vs. the standard 1GB. And all other options in
> cassandra.in.sh at their defaults.
>
> Sorry I am a bit new to JVM tuning, and very new to Cassandra :)
>
> -- Ilya
>
> On Tue, Apr 6, 2010 at 12:16 AM, Benjamin Black <b...@b3k.us> wrote:
>> I am specifically suggesting you NOT use a heap that large with your
>> 8GB machines.  Please test with 4GB first.
>>
>> On Tue, Apr 6, 2010 at 12:13 AM, Ilya Maykov <iv...@gmail.com> wrote:
>>> That does sound similar. It's possible that the difference I'm seeing
>>> between ConsistencyLevel.ZERO and ConsistencyLevel.ALL is simply due
>>> to the fact that using ALL slows down the writers enough that the GC
>>> can keep up. I could do a test with multiple clients writing at ALL in
>>> parallel tomorrow. If there are still no problems writing at ALL even
>>> with extra load from additional clients, that might point to problems
>>> in how async writes are handled vs. sync writes.
>>>
>>> I will also do some profiling of the server processes with both ZERO
>>> and ALL writer behaviors and report back.
>>>
>>> RE: JVM_OPTS, I will try running with the "more sane" options (but a
>>> larger heap) as well.
>>>
>>> -- Ilya
>>>
>>> On Mon, Apr 5, 2010 at 11:59 PM, Rob Coli <rc...@digg.com> wrote:
>>>> On 4/5/10 11:48 PM, Ilya Maykov wrote:
>>>>>
>>>>> No, the disks on all nodes have about 750GB free space. Also as
>>>>> mentioned in my follow-up email, writing with ConsistencyLevel.ALL
>>>>> makes the slowdowns / crashes go away.
>>>>
>>>> I am not sure if the above is consistent with the cause of #896, but the
>>>> other symptoms ("I inserted a bunch of data really fast via Thrift and GC
>>>> melted my machine!") sound like it..
>>>>
>>>> https://issues.apache.org/jira/browse/CASSANDRA-896
>>>>
>>>> =Rob
>>>>
>>>
>>
>

Re: Overwhelming a cluster with writes?

Posted by Jake Luciani <ja...@gmail.com>.

Hi Ilya,

You will always blow up if you use consistancy level zero to write  
gigs of data. The safe minimum for writes is ONE.  Zero is meant for  
small non batched writes.

Also look into batch_mutation call to write lots of data at once, in a  
series of  chunks.  this helps save on network back and forth.

-Jake



On Apr 6, 2010, at 3:18 AM, Ilya Maykov <iv...@gmail.com> wrote:

> Right, I meant 4GB heap vs. the standard 1GB. And all other options in
> cassandra.in.sh at their defaults.
>
> Sorry I am a bit new to JVM tuning, and very new to Cassandra :)
>
> -- Ilya
>
> On Tue, Apr 6, 2010 at 12:16 AM, Benjamin Black <b...@b3k.us> wrote:
>> I am specifically suggesting you NOT use a heap that large with your
>> 8GB machines.  Please test with 4GB first.
>>
>> On Tue, Apr 6, 2010 at 12:13 AM, Ilya Maykov <iv...@gmail.com>  
>> wrote:
>>> That does sound similar. It's possible that the difference I'm  
>>> seeing
>>> between ConsistencyLevel.ZERO and ConsistencyLevel.ALL is simply due
>>> to the fact that using ALL slows down the writers enough that the GC
>>> can keep up. I could do a test with multiple clients writing at  
>>> ALL in
>>> parallel tomorrow. If there are still no problems writing at ALL  
>>> even
>>> with extra load from additional clients, that might point to  
>>> problems
>>> in how async writes are handled vs. sync writes.
>>>
>>> I will also do some profiling of the server processes with both ZERO
>>> and ALL writer behaviors and report back.
>>>
>>> RE: JVM_OPTS, I will try running with the "more sane" options (but a
>>> larger heap) as well.
>>>
>>> -- Ilya
>>>
>>> On Mon, Apr 5, 2010 at 11:59 PM, Rob Coli <rc...@digg.com> wrote:
>>>> On 4/5/10 11:48 PM, Ilya Maykov wrote:
>>>>>
>>>>> No, the disks on all nodes have about 750GB free space. Also as
>>>>> mentioned in my follow-up email, writing with ConsistencyLevel.ALL
>>>>> makes the slowdowns / crashes go away.
>>>>
>>>> I am not sure if the above is consistent with the cause of #896,  
>>>> but the
>>>> other symptoms ("I inserted a bunch of data really fast via  
>>>> Thrift and GC
>>>> melted my machine!") sound like it..
>>>>
>>>> https://issues.apache.org/jira/browse/CASSANDRA-896
>>>>
>>>> =Rob
>>>>
>>>
>>

Re: Overwhelming a cluster with writes?

Posted by Ilya Maykov <iv...@gmail.com>.

Right, I meant 4GB heap vs. the standard 1GB. And all other options in
cassandra.in.sh at their defaults.

Sorry I am a bit new to JVM tuning, and very new to Cassandra :)

-- Ilya

On Tue, Apr 6, 2010 at 12:16 AM, Benjamin Black <b...@b3k.us> wrote:
> I am specifically suggesting you NOT use a heap that large with your
> 8GB machines.  Please test with 4GB first.
>
> On Tue, Apr 6, 2010 at 12:13 AM, Ilya Maykov <iv...@gmail.com> wrote:
>> That does sound similar. It's possible that the difference I'm seeing
>> between ConsistencyLevel.ZERO and ConsistencyLevel.ALL is simply due
>> to the fact that using ALL slows down the writers enough that the GC
>> can keep up. I could do a test with multiple clients writing at ALL in
>> parallel tomorrow. If there are still no problems writing at ALL even
>> with extra load from additional clients, that might point to problems
>> in how async writes are handled vs. sync writes.
>>
>> I will also do some profiling of the server processes with both ZERO
>> and ALL writer behaviors and report back.
>>
>> RE: JVM_OPTS, I will try running with the "more sane" options (but a
>> larger heap) as well.
>>
>> -- Ilya
>>
>> On Mon, Apr 5, 2010 at 11:59 PM, Rob Coli <rc...@digg.com> wrote:
>>> On 4/5/10 11:48 PM, Ilya Maykov wrote:
>>>>
>>>> No, the disks on all nodes have about 750GB free space. Also as
>>>> mentioned in my follow-up email, writing with ConsistencyLevel.ALL
>>>> makes the slowdowns / crashes go away.
>>>
>>> I am not sure if the above is consistent with the cause of #896, but the
>>> other symptoms ("I inserted a bunch of data really fast via Thrift and GC
>>> melted my machine!") sound like it..
>>>
>>> https://issues.apache.org/jira/browse/CASSANDRA-896
>>>
>>> =Rob
>>>
>>
>

Re: Overwhelming a cluster with writes?

Posted by Benjamin Black <b...@b3k.us>.

I am specifically suggesting you NOT use a heap that large with your
8GB machines.  Please test with 4GB first.

On Tue, Apr 6, 2010 at 12:13 AM, Ilya Maykov <iv...@gmail.com> wrote:
> That does sound similar. It's possible that the difference I'm seeing
> between ConsistencyLevel.ZERO and ConsistencyLevel.ALL is simply due
> to the fact that using ALL slows down the writers enough that the GC
> can keep up. I could do a test with multiple clients writing at ALL in
> parallel tomorrow. If there are still no problems writing at ALL even
> with extra load from additional clients, that might point to problems
> in how async writes are handled vs. sync writes.
>
> I will also do some profiling of the server processes with both ZERO
> and ALL writer behaviors and report back.
>
> RE: JVM_OPTS, I will try running with the "more sane" options (but a
> larger heap) as well.
>
> -- Ilya
>
> On Mon, Apr 5, 2010 at 11:59 PM, Rob Coli <rc...@digg.com> wrote:
>> On 4/5/10 11:48 PM, Ilya Maykov wrote:
>>>
>>> No, the disks on all nodes have about 750GB free space. Also as
>>> mentioned in my follow-up email, writing with ConsistencyLevel.ALL
>>> makes the slowdowns / crashes go away.
>>
>> I am not sure if the above is consistent with the cause of #896, but the
>> other symptoms ("I inserted a bunch of data really fast via Thrift and GC
>> melted my machine!") sound like it..
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-896
>>
>> =Rob
>>
>

Re: Overwhelming a cluster with writes?

Posted by Ilya Maykov <iv...@gmail.com>.

That does sound similar. It's possible that the difference I'm seeing
between ConsistencyLevel.ZERO and ConsistencyLevel.ALL is simply due
to the fact that using ALL slows down the writers enough that the GC
can keep up. I could do a test with multiple clients writing at ALL in
parallel tomorrow. If there are still no problems writing at ALL even
with extra load from additional clients, that might point to problems
in how async writes are handled vs. sync writes.

I will also do some profiling of the server processes with both ZERO
and ALL writer behaviors and report back.

RE: JVM_OPTS, I will try running with the "more sane" options (but a
larger heap) as well.

-- Ilya

On Mon, Apr 5, 2010 at 11:59 PM, Rob Coli <rc...@digg.com> wrote:
> On 4/5/10 11:48 PM, Ilya Maykov wrote:
>>
>> No, the disks on all nodes have about 750GB free space. Also as
>> mentioned in my follow-up email, writing with ConsistencyLevel.ALL
>> makes the slowdowns / crashes go away.
>
> I am not sure if the above is consistent with the cause of #896, but the
> other symptoms ("I inserted a bunch of data really fast via Thrift and GC
> melted my machine!") sound like it..
>
> https://issues.apache.org/jira/browse/CASSANDRA-896
>
> =Rob
>

Re: Overwhelming a cluster with writes?

Posted by Rob Coli <rc...@digg.com>.

On 4/5/10 11:48 PM, Ilya Maykov wrote:
> No, the disks on all nodes have about 750GB free space. Also as
> mentioned in my follow-up email, writing with ConsistencyLevel.ALL
> makes the slowdowns / crashes go away.

I am not sure if the above is consistent with the cause of #896, but the 
other symptoms ("I inserted a bunch of data really fast via Thrift and 
GC melted my machine!") sound like it..

https://issues.apache.org/jira/browse/CASSANDRA-896

=Rob

Re: Overwhelming a cluster with writes?

Posted by Benjamin Black <b...@b3k.us>.

You are blowing away the mostly saner JVM_OPTS running it that way.
Edit cassandra.in.sh (or wherever config is on your system) to
increase mx to 4G (not 6G, for now) and leave everything else
untouched and do not specify JVM_OPTS on the command line.  See if you
get the same behavior.


b

On Mon, Apr 5, 2010 at 11:48 PM, Ilya Maykov <iv...@gmail.com> wrote:
> No, the disks on all nodes have about 750GB free space. Also as
> mentioned in my follow-up email, writing with ConsistencyLevel.ALL
> makes the slowdowns / crashes go away.
>
> -- Ilya
>
> On Mon, Apr 5, 2010 at 11:46 PM, Ran Tavory <ra...@gmail.com> wrote:
>> Do you see one of the disks used by cassandra filled up when a node crashes?
>>
>> On Tue, Apr 6, 2010 at 9:39 AM, Ilya Maykov <iv...@gmail.com> wrote:
>>>
>>> I'm running the nodes with a JVM heap size of 6GB, and here are the
>>> related options from my storage-conf.xml. As mentioned in the first
>>> email, I left everything at the default value. I briefly googled
>>> around for "Cassandra performance tuning" etc but haven't found a
>>> definitive guide ... any help with tuning these parameters is greatly
>>> appreciated!
>>>
>>>  <DiskAccessMode>auto</DiskAccessMode>
>>>  <RowWarningThresholdInMB>512</RowWarningThresholdInMB>
>>>  <SlicedBufferSizeInKB>64</SlicedBufferSizeInKB>
>>>  <FlushDataBufferSizeInMB>32</FlushDataBufferSizeInMB>
>>>  <FlushIndexBufferSizeInMB>8</FlushIndexBufferSizeInMB>
>>>  <ColumnIndexSizeInKB>64</ColumnIndexSizeInKB>
>>>  <MemtableThroughputInMB>64</MemtableThroughputInMB>
>>>  <BinaryMemtableThroughputInMB>256</BinaryMemtableThroughputInMB>
>>>  <MemtableOperationsInMillions>0.3</MemtableOperationsInMillions>
>>>  <MemtableFlushAfterMinutes>60</MemtableFlushAfterMinutes>
>>>  <ConcurrentReads>8</ConcurrentReads>
>>>  <ConcurrentWrites>64</ConcurrentWrites>
>>>  <CommitLogSync>periodic</CommitLogSync>
>>>  <CommitLogSyncPeriodInMS>10000</CommitLogSyncPeriodInMS>
>>>  <GCGraceSeconds>864000</GCGraceSeconds>
>>>
>>> -- Ilya
>>>
>>> On Mon, Apr 5, 2010 at 11:26 PM, Boris Shulman <sh...@gmail.com> wrote:
>>> > You are running out of memory on your nodes. Before the final crash
>>> > your nodes are probably slow  due to GC. What is your memtable size?
>>> > What cache options did you configure?
>>> >
>>> > On Tue, Apr 6, 2010 at 7:31 AM, Ilya Maykov <iv...@gmail.com> wrote:
>>> >> Hi all,
>>> >>
>>> >> I've just started experimenting with Cassandra to get a feel for the
>>> >> system. I've set up a test cluster and to get a ballpark idea of its
>>> >> performance I wrote a simple tool to load some toy data into the
>>> >> system. Surprisingly, I am able to "overwhelm" my 4-node cluster with
>>> >> writes from a single client. I'm trying to figure out if this is a
>>> >> problem with my setup, if I'm hitting bugs in the Cassandra codebase,
>>> >> or if this is intended behavior. Sorry this email is kind of long,
>>> >> here is the TLDR version:
>>> >>
>>> >> While writing to Cassandra from a single node, I am able to get the
>>> >> cluster into a bad state, where nodes are randomly disconnecting from
>>> >> each other, write performance plummets, and sometimes nodes even
>>> >> crash. Further, the nodes do not recover as long as the writes
>>> >> continue (even at a much lower rate), and sometimes do not recover at
>>> >> all unless I restart them. I can get this to happen simply by throwing
>>> >> data at the cluster fast enough, and I'm wondering if this is a known
>>> >> issue or if I need to tweak my setup.
>>> >>
>>> >> Now, the details.
>>> >>
>>> >> First, a little bit about the setup:
>>> >>
>>> >> 4-node cluster of identical machines, running cassandra-0.6.0-rc1 with
>>> >> the fixes for CASSANDRA-933, CASSANDRA-934, and CASSANDRA-936 patched
>>> >> in. Node specs:
>>> >> 8-core Intel Xeon E5405@2.00GHz
>>> >> 8GB RAM
>>> >> 1Gbit ethernet
>>> >> Red Hat Linux 2.6.18
>>> >> JVM 1.6.0_19 64-bit
>>> >> 1TB spinning disk houses both commitlog and data directories (which I
>>> >> know is not ideal).
>>> >> The client machine is on the same local network and has very similar
>>> >> specs.
>>> >>
>>> >> The cassandra nodes are started with the following JVM options:
>>> >>
>>> >> ./cassandra JVM_OPTS="-Xms6144m -Xmx6144m -XX:+UseConcMarkSweepGC -d64
>>> >> -XX:NewSize=1024m -XX:MaxNewSize=1024m -XX:+DisableExplicitGC"
>>> >>
>>> >> I'm using default settings for all of the tunable stuff at the bottom
>>> >> of storage-conf.xml. I also selected my initial tokens to evenly
>>> >> partition the key space when the cluster was bootstrapped. I am using
>>> >> the RandomPartitioner.
>>> >>
>>> >> Now, about the test. Basically I am trying to get an idea of just how
>>> >> fast I can make this thing go. I am writing ~250M data records into
>>> >> the cluster, replicated at 3x, using Ran Tavory's Hector client
>>> >> (Java), writing with ConsistencyLevel.ZERO and
>>> >> FailoverPolicy.FAIL_FAST. The client is using 32 threads with 8
>>> >> threads talking to each of the 4 nodes in the cluster. Records are
>>> >> identified by a numeric id, and I'm writing them in batches of up to
>>> >> 10k records per row, with each record in its own column. The row key
>>> >> identifies the bucket into which records fall. So, records with ids 0
>>> >> - 9999 are written to row "0", 10000 - 19999 are written to row
>>> >> "10000", etc. Each record is a JSON object with ~10-20 fields.
>>> >>
>>> >> Records: {  // Column Family
>>> >>   0 : {  // row key for the start of the bucket. Buckets span a range
>>> >> of up to 10000 records
>>> >>     1 : "{ /* some JSON */ }",  // Column for record with id=1
>>> >>     3 : "{ /* some more JSON */ }",  // Column for record with id=3
>>> >>    ...
>>> >>    9999 : "{ /* ... */ }"
>>> >>   },
>>> >>  10000 : {  // row key for the start of the next bucket
>>> >>    10001 : ...
>>> >>    10004 :
>>> >> }
>>> >>
>>> >> I am reading the data out of a local, sorted file on the client, so I
>>> >> only write a row to Cassandra once all records for that row have been
>>> >> read, and each row is written to exactly once. I'm using a
>>> >> producer-consumer queue to pump data from the input reader thread to
>>> >> the output writer threads. I found that I have to throttle the reader
>>> >> thread heavily in order to get good behavior. So, if I make the reader
>>> >> sleep for 7 seconds every 1M records, everything is fine - the data
>>> >> loads in about an hour, half of which is spent by the reader thread
>>> >> sleeping. In between the sleeps, I see ~40-50 MB/s throughput on the
>>> >> client's network interface while the reader is not sleeping, and it
>>> >> takes ~7-8 seconds to write each batch of 1M records.
>>> >>
>>> >> Now, if I remove the 7 second sleeps on the client side, things get
>>> >> bad after the first ~8M records are written to the client. Write
>>> >> throughput drops to <5 MB/s. I start seeing messages about nodes
>>> >> disconnecting and reconnecting in Cassandra's system.log, as well as
>>> >> lots of GC messages:
>>> >>
>>> >> ...
>>> >>  INFO [Timer-1] 2010-04-06 04:03:27,178 Gossiper.java (line 179)
>>> >> InetAddress /10.15.38.88 is now dead.
>>> >>  INFO [GC inspection] 2010-04-06 04:03:30,259 GCInspector.java (line
>>> >> 110) GC for ConcurrentMarkSweep: 2989 ms, 55326320 reclaimed leaving
>>> >> 1035998648 used; max is 1211170816
>>> >>  INFO [GC inspection] 2010-04-06 04:03:41,838 GCInspector.java (line
>>> >> 110) GC for ConcurrentMarkSweep: 3004 ms, 24377240 reclaimed leaving
>>> >> 1066120952 used; max is 1211170816
>>> >>  INFO [Timer-1] 2010-04-06 04:03:44,136 Gossiper.java (line 179)
>>> >> InetAddress /10.15.38.55 is now dead.
>>> >>  INFO [GMFD:1] 2010-04-06 04:03:44,138 Gossiper.java (line 568)
>>> >> InetAddress /10.15.38.55 is now UP
>>> >>  INFO [GC inspection] 2010-04-06 04:03:52,957 GCInspector.java (line
>>> >> 110) GC for ConcurrentMarkSweep: 2319 ms, 4504888 reclaimed leaving
>>> >> 1086023832 used; max is 1211170816
>>> >>  INFO [Timer-1] 2010-04-06 04:04:19,508 Gossiper.java (line 179)
>>> >> InetAddress /10.15.38.242 is now dead.
>>> >>  INFO [Timer-1] 2010-04-06 04:05:03,039 Gossiper.java (line 179)
>>> >> InetAddress /10.15.38.55 is now dead.
>>> >>  INFO [GMFD:1] 2010-04-06 04:05:03,041 Gossiper.java (line 568)
>>> >> InetAddress /10.15.38.55 is now UP
>>> >>  INFO [GC inspection] 2010-04-06 04:05:08,539 GCInspector.java (line
>>> >> 110) GC for ConcurrentMarkSweep: 2375 ms, 39534920 reclaimed leaving
>>> >> 1051620856 used; max is 1211170816
>>> >> ...
>>> >>
>>> >> Finally followed by this and some/all nodes going down:
>>> >>
>>> >> ERROR [COMPACTION-POOL:1] 2010-04-06 04:05:05,475
>>> >> DebuggableThreadPoolExecutor.java (line 94) Error in executor
>>> >> futuretask
>>> >> java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError:
>>> >> Java heap space
>>> >>        at java.util.concurrent.FutureTask$Sync.innerGet(Unknown Source)
>>> >>        at java.util.concurrent.FutureTask.get(Unknown Source)
>>> >>        at
>>> >> org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.afterExecute(DebuggableThreadPoolExecutor.java:86)
>>> >>        at
>>> >> org.apache.cassandra.db.CompactionManager$CompactionExecutor.afterExecute(CompactionManager.java:582)
>>> >>        at
>>> >> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
>>> >>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
>>> >> Source)
>>> >>        at java.lang.Thread.run(Unknown Source)
>>> >> Caused by: java.lang.OutOfMemoryError: Java heap space
>>> >>        at java.util.Arrays.copyOf(Unknown Source)
>>> >>        at java.io.ByteArrayOutputStream.write(Unknown Source)
>>> >>        at java.io.DataOutputStream.write(Unknown Source)
>>> >>        at
>>> >> org.apache.cassandra.io.IteratingRow.echoData(IteratingRow.java:69)
>>> >>        at
>>> >> org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:138)
>>> >>        at
>>> >> org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:1)
>>> >>        at
>>> >> org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73)
>>> >>        at
>>> >> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
>>> >>        at
>>> >> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
>>> >>        at
>>> >> org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183)
>>> >>        at
>>> >> org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94)
>>> >>        at
>>> >> org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:299)
>>> >>        at
>>> >> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:102)
>>> >>        at
>>> >> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:1)
>>> >>        at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
>>> >>        at java.util.concurrent.FutureTask.run(Unknown Source)
>>> >>        ... 3 more
>>> >>
>>> >> At first I thought that with ConsistencyLevel.ZERO I must be doing
>>> >> async writes so Cassandra can't push back on the client threads (by
>>> >> blocking them), thus the server is getting overwhelmed. But, I would
>>> >> expect it to start dropping data and not crash in that case (after
>>> >> all, I did say ZERO so I can't expect any reliability, right?).
>>> >> However, I see similar slowdown / node dropout behavior when I set the
>>> >> consistency level to ONE. Does Cassandra push back on writers under
>>> >> heavy load? Is there some magic setting I need to tune to have it not
>>> >> fall over? Do I just need a bigger cluster? Thanks in advance,
>>> >>
>>> >> -- Ilya
>>> >>
>>> >> P.S. I realize that it's still handling a LOT of data with just 4
>>> >> nodes, and in practice nobody would run a system that gets 125k writes
>>> >> per second on top of a 4 node cluster. I was just surprised that I
>>> >> could make Cassandra fall over at all using a single client that's
>>> >> pumping data at 40-50 MB/s.
>>> >>
>>> >
>>
>>
>

Re: Overwhelming a cluster with writes?

Posted by Ilya Maykov <iv...@gmail.com>.

No, the disks on all nodes have about 750GB free space. Also as
mentioned in my follow-up email, writing with ConsistencyLevel.ALL
makes the slowdowns / crashes go away.

-- Ilya

On Mon, Apr 5, 2010 at 11:46 PM, Ran Tavory <ra...@gmail.com> wrote:
> Do you see one of the disks used by cassandra filled up when a node crashes?
>
> On Tue, Apr 6, 2010 at 9:39 AM, Ilya Maykov <iv...@gmail.com> wrote:
>>
>> I'm running the nodes with a JVM heap size of 6GB, and here are the
>> related options from my storage-conf.xml. As mentioned in the first
>> email, I left everything at the default value. I briefly googled
>> around for "Cassandra performance tuning" etc but haven't found a
>> definitive guide ... any help with tuning these parameters is greatly
>> appreciated!
>>
>>  <DiskAccessMode>auto</DiskAccessMode>
>>  <RowWarningThresholdInMB>512</RowWarningThresholdInMB>
>>  <SlicedBufferSizeInKB>64</SlicedBufferSizeInKB>
>>  <FlushDataBufferSizeInMB>32</FlushDataBufferSizeInMB>
>>  <FlushIndexBufferSizeInMB>8</FlushIndexBufferSizeInMB>
>>  <ColumnIndexSizeInKB>64</ColumnIndexSizeInKB>
>>  <MemtableThroughputInMB>64</MemtableThroughputInMB>
>>  <BinaryMemtableThroughputInMB>256</BinaryMemtableThroughputInMB>
>>  <MemtableOperationsInMillions>0.3</MemtableOperationsInMillions>
>>  <MemtableFlushAfterMinutes>60</MemtableFlushAfterMinutes>
>>  <ConcurrentReads>8</ConcurrentReads>
>>  <ConcurrentWrites>64</ConcurrentWrites>
>>  <CommitLogSync>periodic</CommitLogSync>
>>  <CommitLogSyncPeriodInMS>10000</CommitLogSyncPeriodInMS>
>>  <GCGraceSeconds>864000</GCGraceSeconds>
>>
>> -- Ilya
>>
>> On Mon, Apr 5, 2010 at 11:26 PM, Boris Shulman <sh...@gmail.com> wrote:
>> > You are running out of memory on your nodes. Before the final crash
>> > your nodes are probably slow  due to GC. What is your memtable size?
>> > What cache options did you configure?
>> >
>> > On Tue, Apr 6, 2010 at 7:31 AM, Ilya Maykov <iv...@gmail.com> wrote:
>> >> Hi all,
>> >>
>> >> I've just started experimenting with Cassandra to get a feel for the
>> >> system. I've set up a test cluster and to get a ballpark idea of its
>> >> performance I wrote a simple tool to load some toy data into the
>> >> system. Surprisingly, I am able to "overwhelm" my 4-node cluster with
>> >> writes from a single client. I'm trying to figure out if this is a
>> >> problem with my setup, if I'm hitting bugs in the Cassandra codebase,
>> >> or if this is intended behavior. Sorry this email is kind of long,
>> >> here is the TLDR version:
>> >>
>> >> While writing to Cassandra from a single node, I am able to get the
>> >> cluster into a bad state, where nodes are randomly disconnecting from
>> >> each other, write performance plummets, and sometimes nodes even
>> >> crash. Further, the nodes do not recover as long as the writes
>> >> continue (even at a much lower rate), and sometimes do not recover at
>> >> all unless I restart them. I can get this to happen simply by throwing
>> >> data at the cluster fast enough, and I'm wondering if this is a known
>> >> issue or if I need to tweak my setup.
>> >>
>> >> Now, the details.
>> >>
>> >> First, a little bit about the setup:
>> >>
>> >> 4-node cluster of identical machines, running cassandra-0.6.0-rc1 with
>> >> the fixes for CASSANDRA-933, CASSANDRA-934, and CASSANDRA-936 patched
>> >> in. Node specs:
>> >> 8-core Intel Xeon E5405@2.00GHz
>> >> 8GB RAM
>> >> 1Gbit ethernet
>> >> Red Hat Linux 2.6.18
>> >> JVM 1.6.0_19 64-bit
>> >> 1TB spinning disk houses both commitlog and data directories (which I
>> >> know is not ideal).
>> >> The client machine is on the same local network and has very similar
>> >> specs.
>> >>
>> >> The cassandra nodes are started with the following JVM options:
>> >>
>> >> ./cassandra JVM_OPTS="-Xms6144m -Xmx6144m -XX:+UseConcMarkSweepGC -d64
>> >> -XX:NewSize=1024m -XX:MaxNewSize=1024m -XX:+DisableExplicitGC"
>> >>
>> >> I'm using default settings for all of the tunable stuff at the bottom
>> >> of storage-conf.xml. I also selected my initial tokens to evenly
>> >> partition the key space when the cluster was bootstrapped. I am using
>> >> the RandomPartitioner.
>> >>
>> >> Now, about the test. Basically I am trying to get an idea of just how
>> >> fast I can make this thing go. I am writing ~250M data records into
>> >> the cluster, replicated at 3x, using Ran Tavory's Hector client
>> >> (Java), writing with ConsistencyLevel.ZERO and
>> >> FailoverPolicy.FAIL_FAST. The client is using 32 threads with 8
>> >> threads talking to each of the 4 nodes in the cluster. Records are
>> >> identified by a numeric id, and I'm writing them in batches of up to
>> >> 10k records per row, with each record in its own column. The row key
>> >> identifies the bucket into which records fall. So, records with ids 0
>> >> - 9999 are written to row "0", 10000 - 19999 are written to row
>> >> "10000", etc. Each record is a JSON object with ~10-20 fields.
>> >>
>> >> Records: {  // Column Family
>> >>   0 : {  // row key for the start of the bucket. Buckets span a range
>> >> of up to 10000 records
>> >>     1 : "{ /* some JSON */ }",  // Column for record with id=1
>> >>     3 : "{ /* some more JSON */ }",  // Column for record with id=3
>> >>    ...
>> >>    9999 : "{ /* ... */ }"
>> >>   },
>> >>  10000 : {  // row key for the start of the next bucket
>> >>    10001 : ...
>> >>    10004 :
>> >> }
>> >>
>> >> I am reading the data out of a local, sorted file on the client, so I
>> >> only write a row to Cassandra once all records for that row have been
>> >> read, and each row is written to exactly once. I'm using a
>> >> producer-consumer queue to pump data from the input reader thread to
>> >> the output writer threads. I found that I have to throttle the reader
>> >> thread heavily in order to get good behavior. So, if I make the reader
>> >> sleep for 7 seconds every 1M records, everything is fine - the data
>> >> loads in about an hour, half of which is spent by the reader thread
>> >> sleeping. In between the sleeps, I see ~40-50 MB/s throughput on the
>> >> client's network interface while the reader is not sleeping, and it
>> >> takes ~7-8 seconds to write each batch of 1M records.
>> >>
>> >> Now, if I remove the 7 second sleeps on the client side, things get
>> >> bad after the first ~8M records are written to the client. Write
>> >> throughput drops to <5 MB/s. I start seeing messages about nodes
>> >> disconnecting and reconnecting in Cassandra's system.log, as well as
>> >> lots of GC messages:
>> >>
>> >> ...
>> >>  INFO [Timer-1] 2010-04-06 04:03:27,178 Gossiper.java (line 179)
>> >> InetAddress /10.15.38.88 is now dead.
>> >>  INFO [GC inspection] 2010-04-06 04:03:30,259 GCInspector.java (line
>> >> 110) GC for ConcurrentMarkSweep: 2989 ms, 55326320 reclaimed leaving
>> >> 1035998648 used; max is 1211170816
>> >>  INFO [GC inspection] 2010-04-06 04:03:41,838 GCInspector.java (line
>> >> 110) GC for ConcurrentMarkSweep: 3004 ms, 24377240 reclaimed leaving
>> >> 1066120952 used; max is 1211170816
>> >>  INFO [Timer-1] 2010-04-06 04:03:44,136 Gossiper.java (line 179)
>> >> InetAddress /10.15.38.55 is now dead.
>> >>  INFO [GMFD:1] 2010-04-06 04:03:44,138 Gossiper.java (line 568)
>> >> InetAddress /10.15.38.55 is now UP
>> >>  INFO [GC inspection] 2010-04-06 04:03:52,957 GCInspector.java (line
>> >> 110) GC for ConcurrentMarkSweep: 2319 ms, 4504888 reclaimed leaving
>> >> 1086023832 used; max is 1211170816
>> >>  INFO [Timer-1] 2010-04-06 04:04:19,508 Gossiper.java (line 179)
>> >> InetAddress /10.15.38.242 is now dead.
>> >>  INFO [Timer-1] 2010-04-06 04:05:03,039 Gossiper.java (line 179)
>> >> InetAddress /10.15.38.55 is now dead.
>> >>  INFO [GMFD:1] 2010-04-06 04:05:03,041 Gossiper.java (line 568)
>> >> InetAddress /10.15.38.55 is now UP
>> >>  INFO [GC inspection] 2010-04-06 04:05:08,539 GCInspector.java (line
>> >> 110) GC for ConcurrentMarkSweep: 2375 ms, 39534920 reclaimed leaving
>> >> 1051620856 used; max is 1211170816
>> >> ...
>> >>
>> >> Finally followed by this and some/all nodes going down:
>> >>
>> >> ERROR [COMPACTION-POOL:1] 2010-04-06 04:05:05,475
>> >> DebuggableThreadPoolExecutor.java (line 94) Error in executor
>> >> futuretask
>> >> java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError:
>> >> Java heap space
>> >>        at java.util.concurrent.FutureTask$Sync.innerGet(Unknown Source)
>> >>        at java.util.concurrent.FutureTask.get(Unknown Source)
>> >>        at
>> >> org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.afterExecute(DebuggableThreadPoolExecutor.java:86)
>> >>        at
>> >> org.apache.cassandra.db.CompactionManager$CompactionExecutor.afterExecute(CompactionManager.java:582)
>> >>        at
>> >> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
>> >>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
>> >> Source)
>> >>        at java.lang.Thread.run(Unknown Source)
>> >> Caused by: java.lang.OutOfMemoryError: Java heap space
>> >>        at java.util.Arrays.copyOf(Unknown Source)
>> >>        at java.io.ByteArrayOutputStream.write(Unknown Source)
>> >>        at java.io.DataOutputStream.write(Unknown Source)
>> >>        at
>> >> org.apache.cassandra.io.IteratingRow.echoData(IteratingRow.java:69)
>> >>        at
>> >> org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:138)
>> >>        at
>> >> org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:1)
>> >>        at
>> >> org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73)
>> >>        at
>> >> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
>> >>        at
>> >> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
>> >>        at
>> >> org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183)
>> >>        at
>> >> org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94)
>> >>        at
>> >> org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:299)
>> >>        at
>> >> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:102)
>> >>        at
>> >> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:1)
>> >>        at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
>> >>        at java.util.concurrent.FutureTask.run(Unknown Source)
>> >>        ... 3 more
>> >>
>> >> At first I thought that with ConsistencyLevel.ZERO I must be doing
>> >> async writes so Cassandra can't push back on the client threads (by
>> >> blocking them), thus the server is getting overwhelmed. But, I would
>> >> expect it to start dropping data and not crash in that case (after
>> >> all, I did say ZERO so I can't expect any reliability, right?).
>> >> However, I see similar slowdown / node dropout behavior when I set the
>> >> consistency level to ONE. Does Cassandra push back on writers under
>> >> heavy load? Is there some magic setting I need to tune to have it not
>> >> fall over? Do I just need a bigger cluster? Thanks in advance,
>> >>
>> >> -- Ilya
>> >>
>> >> P.S. I realize that it's still handling a LOT of data with just 4
>> >> nodes, and in practice nobody would run a system that gets 125k writes
>> >> per second on top of a 4 node cluster. I was just surprised that I
>> >> could make Cassandra fall over at all using a single client that's
>> >> pumping data at 40-50 MB/s.
>> >>
>> >
>
>

Re: Overwhelming a cluster with writes?

Posted by Ran Tavory <ra...@gmail.com>.

Do you see one of the disks used by cassandra filled up when a node crashes?

On Tue, Apr 6, 2010 at 9:39 AM, Ilya Maykov <iv...@gmail.com> wrote:

> I'm running the nodes with a JVM heap size of 6GB, and here are the
> related options from my storage-conf.xml. As mentioned in the first
> email, I left everything at the default value. I briefly googled
> around for "Cassandra performance tuning" etc but haven't found a
> definitive guide ... any help with tuning these parameters is greatly
> appreciated!
>
>  <DiskAccessMode>auto</DiskAccessMode>
>  <RowWarningThresholdInMB>512</RowWarningThresholdInMB>
>  <SlicedBufferSizeInKB>64</SlicedBufferSizeInKB>
>  <FlushDataBufferSizeInMB>32</FlushDataBufferSizeInMB>
>  <FlushIndexBufferSizeInMB>8</FlushIndexBufferSizeInMB>
>  <ColumnIndexSizeInKB>64</ColumnIndexSizeInKB>
>  <MemtableThroughputInMB>64</MemtableThroughputInMB>
>  <BinaryMemtableThroughputInMB>256</BinaryMemtableThroughputInMB>
>  <MemtableOperationsInMillions>0.3</MemtableOperationsInMillions>
>  <MemtableFlushAfterMinutes>60</MemtableFlushAfterMinutes>
>  <ConcurrentReads>8</ConcurrentReads>
>  <ConcurrentWrites>64</ConcurrentWrites>
>  <CommitLogSync>periodic</CommitLogSync>
>  <CommitLogSyncPeriodInMS>10000</CommitLogSyncPeriodInMS>
>  <GCGraceSeconds>864000</GCGraceSeconds>
>
> -- Ilya
>
> On Mon, Apr 5, 2010 at 11:26 PM, Boris Shulman <sh...@gmail.com> wrote:
> > You are running out of memory on your nodes. Before the final crash
> > your nodes are probably slow  due to GC. What is your memtable size?
> > What cache options did you configure?
> >
> > On Tue, Apr 6, 2010 at 7:31 AM, Ilya Maykov <iv...@gmail.com> wrote:
> >> Hi all,
> >>
> >> I've just started experimenting with Cassandra to get a feel for the
> >> system. I've set up a test cluster and to get a ballpark idea of its
> >> performance I wrote a simple tool to load some toy data into the
> >> system. Surprisingly, I am able to "overwhelm" my 4-node cluster with
> >> writes from a single client. I'm trying to figure out if this is a
> >> problem with my setup, if I'm hitting bugs in the Cassandra codebase,
> >> or if this is intended behavior. Sorry this email is kind of long,
> >> here is the TLDR version:
> >>
> >> While writing to Cassandra from a single node, I am able to get the
> >> cluster into a bad state, where nodes are randomly disconnecting from
> >> each other, write performance plummets, and sometimes nodes even
> >> crash. Further, the nodes do not recover as long as the writes
> >> continue (even at a much lower rate), and sometimes do not recover at
> >> all unless I restart them. I can get this to happen simply by throwing
> >> data at the cluster fast enough, and I'm wondering if this is a known
> >> issue or if I need to tweak my setup.
> >>
> >> Now, the details.
> >>
> >> First, a little bit about the setup:
> >>
> >> 4-node cluster of identical machines, running cassandra-0.6.0-rc1 with
> >> the fixes for CASSANDRA-933, CASSANDRA-934, and CASSANDRA-936 patched
> >> in. Node specs:
> >> 8-core Intel Xeon E5405@2.00GHz
> >> 8GB RAM
> >> 1Gbit ethernet
> >> Red Hat Linux 2.6.18
> >> JVM 1.6.0_19 64-bit
> >> 1TB spinning disk houses both commitlog and data directories (which I
> >> know is not ideal).
> >> The client machine is on the same local network and has very similar
> specs.
> >>
> >> The cassandra nodes are started with the following JVM options:
> >>
> >> ./cassandra JVM_OPTS="-Xms6144m -Xmx6144m -XX:+UseConcMarkSweepGC -d64
> >> -XX:NewSize=1024m -XX:MaxNewSize=1024m -XX:+DisableExplicitGC"
> >>
> >> I'm using default settings for all of the tunable stuff at the bottom
> >> of storage-conf.xml. I also selected my initial tokens to evenly
> >> partition the key space when the cluster was bootstrapped. I am using
> >> the RandomPartitioner.
> >>
> >> Now, about the test. Basically I am trying to get an idea of just how
> >> fast I can make this thing go. I am writing ~250M data records into
> >> the cluster, replicated at 3x, using Ran Tavory's Hector client
> >> (Java), writing with ConsistencyLevel.ZERO and
> >> FailoverPolicy.FAIL_FAST. The client is using 32 threads with 8
> >> threads talking to each of the 4 nodes in the cluster. Records are
> >> identified by a numeric id, and I'm writing them in batches of up to
> >> 10k records per row, with each record in its own column. The row key
> >> identifies the bucket into which records fall. So, records with ids 0
> >> - 9999 are written to row "0", 10000 - 19999 are written to row
> >> "10000", etc. Each record is a JSON object with ~10-20 fields.
> >>
> >> Records: {  // Column Family
> >>   0 : {  // row key for the start of the bucket. Buckets span a range
> >> of up to 10000 records
> >>     1 : "{ /* some JSON */ }",  // Column for record with id=1
> >>     3 : "{ /* some more JSON */ }",  // Column for record with id=3
> >>    ...
> >>    9999 : "{ /* ... */ }"
> >>   },
> >>  10000 : {  // row key for the start of the next bucket
> >>    10001 : ...
> >>    10004 :
> >> }
> >>
> >> I am reading the data out of a local, sorted file on the client, so I
> >> only write a row to Cassandra once all records for that row have been
> >> read, and each row is written to exactly once. I'm using a
> >> producer-consumer queue to pump data from the input reader thread to
> >> the output writer threads. I found that I have to throttle the reader
> >> thread heavily in order to get good behavior. So, if I make the reader
> >> sleep for 7 seconds every 1M records, everything is fine - the data
> >> loads in about an hour, half of which is spent by the reader thread
> >> sleeping. In between the sleeps, I see ~40-50 MB/s throughput on the
> >> client's network interface while the reader is not sleeping, and it
> >> takes ~7-8 seconds to write each batch of 1M records.
> >>
> >> Now, if I remove the 7 second sleeps on the client side, things get
> >> bad after the first ~8M records are written to the client. Write
> >> throughput drops to <5 MB/s. I start seeing messages about nodes
> >> disconnecting and reconnecting in Cassandra's system.log, as well as
> >> lots of GC messages:
> >>
> >> ...
> >>  INFO [Timer-1] 2010-04-06 04:03:27,178 Gossiper.java (line 179)
> >> InetAddress /10.15.38.88 is now dead.
> >>  INFO [GC inspection] 2010-04-06 04:03:30,259 GCInspector.java (line
> >> 110) GC for ConcurrentMarkSweep: 2989 ms, 55326320 reclaimed leaving
> >> 1035998648 used; max is 1211170816
> >>  INFO [GC inspection] 2010-04-06 04:03:41,838 GCInspector.java (line
> >> 110) GC for ConcurrentMarkSweep: 3004 ms, 24377240 reclaimed leaving
> >> 1066120952 used; max is 1211170816
> >>  INFO [Timer-1] 2010-04-06 04:03:44,136 Gossiper.java (line 179)
> >> InetAddress /10.15.38.55 is now dead.
> >>  INFO [GMFD:1] 2010-04-06 04:03:44,138 Gossiper.java (line 568)
> >> InetAddress /10.15.38.55 is now UP
> >>  INFO [GC inspection] 2010-04-06 04:03:52,957 GCInspector.java (line
> >> 110) GC for ConcurrentMarkSweep: 2319 ms, 4504888 reclaimed leaving
> >> 1086023832 used; max is 1211170816
> >>  INFO [Timer-1] 2010-04-06 04:04:19,508 Gossiper.java (line 179)
> >> InetAddress /10.15.38.242 is now dead.
> >>  INFO [Timer-1] 2010-04-06 04:05:03,039 Gossiper.java (line 179)
> >> InetAddress /10.15.38.55 is now dead.
> >>  INFO [GMFD:1] 2010-04-06 04:05:03,041 Gossiper.java (line 568)
> >> InetAddress /10.15.38.55 is now UP
> >>  INFO [GC inspection] 2010-04-06 04:05:08,539 GCInspector.java (line
> >> 110) GC for ConcurrentMarkSweep: 2375 ms, 39534920 reclaimed leaving
> >> 1051620856 used; max is 1211170816
> >> ...
> >>
> >> Finally followed by this and some/all nodes going down:
> >>
> >> ERROR [COMPACTION-POOL:1] 2010-04-06 04:05:05,475
> >> DebuggableThreadPoolExecutor.java (line 94) Error in executor
> >> futuretask
> >> java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError:
> >> Java heap space
> >>        at java.util.concurrent.FutureTask$Sync.innerGet(Unknown Source)
> >>        at java.util.concurrent.FutureTask.get(Unknown Source)
> >>        at
> org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.afterExecute(DebuggableThreadPoolExecutor.java:86)
> >>        at
> org.apache.cassandra.db.CompactionManager$CompactionExecutor.afterExecute(CompactionManager.java:582)
> >>        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown
> Source)
> >>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
> Source)
> >>        at java.lang.Thread.run(Unknown Source)
> >> Caused by: java.lang.OutOfMemoryError: Java heap space
> >>        at java.util.Arrays.copyOf(Unknown Source)
> >>        at java.io.ByteArrayOutputStream.write(Unknown Source)
> >>        at java.io.DataOutputStream.write(Unknown Source)
> >>        at
> org.apache.cassandra.io.IteratingRow.echoData(IteratingRow.java:69)
> >>        at
> org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:138)
> >>        at
> org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:1)
> >>        at
> org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73)
> >>        at
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
> >>        at
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
> >>        at
> org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183)
> >>        at
> org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94)
> >>        at
> org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:299)
> >>        at
> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:102)
> >>        at
> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:1)
> >>        at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
> >>        at java.util.concurrent.FutureTask.run(Unknown Source)
> >>        ... 3 more
> >>
> >> At first I thought that with ConsistencyLevel.ZERO I must be doing
> >> async writes so Cassandra can't push back on the client threads (by
> >> blocking them), thus the server is getting overwhelmed. But, I would
> >> expect it to start dropping data and not crash in that case (after
> >> all, I did say ZERO so I can't expect any reliability, right?).
> >> However, I see similar slowdown / node dropout behavior when I set the
> >> consistency level to ONE. Does Cassandra push back on writers under
> >> heavy load? Is there some magic setting I need to tune to have it not
> >> fall over? Do I just need a bigger cluster? Thanks in advance,
> >>
> >> -- Ilya
> >>
> >> P.S. I realize that it's still handling a LOT of data with just 4
> >> nodes, and in practice nobody would run a system that gets 125k writes
> >> per second on top of a 4 node cluster. I was just surprised that I
> >> could make Cassandra fall over at all using a single client that's
> >> pumping data at 40-50 MB/s.
> >>
> >
>

Re: Overwhelming a cluster with writes?

Posted by Ilya Maykov <iv...@gmail.com>.

I'm running the nodes with a JVM heap size of 6GB, and here are the
related options from my storage-conf.xml. As mentioned in the first
email, I left everything at the default value. I briefly googled
around for "Cassandra performance tuning" etc but haven't found a
definitive guide ... any help with tuning these parameters is greatly
appreciated!

  <DiskAccessMode>auto</DiskAccessMode>
  <RowWarningThresholdInMB>512</RowWarningThresholdInMB>
  <SlicedBufferSizeInKB>64</SlicedBufferSizeInKB>
  <FlushDataBufferSizeInMB>32</FlushDataBufferSizeInMB>
  <FlushIndexBufferSizeInMB>8</FlushIndexBufferSizeInMB>
  <ColumnIndexSizeInKB>64</ColumnIndexSizeInKB>
  <MemtableThroughputInMB>64</MemtableThroughputInMB>
  <BinaryMemtableThroughputInMB>256</BinaryMemtableThroughputInMB>
  <MemtableOperationsInMillions>0.3</MemtableOperationsInMillions>
  <MemtableFlushAfterMinutes>60</MemtableFlushAfterMinutes>
  <ConcurrentReads>8</ConcurrentReads>
  <ConcurrentWrites>64</ConcurrentWrites>
  <CommitLogSync>periodic</CommitLogSync>
  <CommitLogSyncPeriodInMS>10000</CommitLogSyncPeriodInMS>
  <GCGraceSeconds>864000</GCGraceSeconds>

-- Ilya

On Mon, Apr 5, 2010 at 11:26 PM, Boris Shulman <sh...@gmail.com> wrote:
> You are running out of memory on your nodes. Before the final crash
> your nodes are probably slow  due to GC. What is your memtable size?
> What cache options did you configure?
>
> On Tue, Apr 6, 2010 at 7:31 AM, Ilya Maykov <iv...@gmail.com> wrote:
>> Hi all,
>>
>> I've just started experimenting with Cassandra to get a feel for the
>> system. I've set up a test cluster and to get a ballpark idea of its
>> performance I wrote a simple tool to load some toy data into the
>> system. Surprisingly, I am able to "overwhelm" my 4-node cluster with
>> writes from a single client. I'm trying to figure out if this is a
>> problem with my setup, if I'm hitting bugs in the Cassandra codebase,
>> or if this is intended behavior. Sorry this email is kind of long,
>> here is the TLDR version:
>>
>> While writing to Cassandra from a single node, I am able to get the
>> cluster into a bad state, where nodes are randomly disconnecting from
>> each other, write performance plummets, and sometimes nodes even
>> crash. Further, the nodes do not recover as long as the writes
>> continue (even at a much lower rate), and sometimes do not recover at
>> all unless I restart them. I can get this to happen simply by throwing
>> data at the cluster fast enough, and I'm wondering if this is a known
>> issue or if I need to tweak my setup.
>>
>> Now, the details.
>>
>> First, a little bit about the setup:
>>
>> 4-node cluster of identical machines, running cassandra-0.6.0-rc1 with
>> the fixes for CASSANDRA-933, CASSANDRA-934, and CASSANDRA-936 patched
>> in. Node specs:
>> 8-core Intel Xeon E5405@2.00GHz
>> 8GB RAM
>> 1Gbit ethernet
>> Red Hat Linux 2.6.18
>> JVM 1.6.0_19 64-bit
>> 1TB spinning disk houses both commitlog and data directories (which I
>> know is not ideal).
>> The client machine is on the same local network and has very similar specs.
>>
>> The cassandra nodes are started with the following JVM options:
>>
>> ./cassandra JVM_OPTS="-Xms6144m -Xmx6144m -XX:+UseConcMarkSweepGC -d64
>> -XX:NewSize=1024m -XX:MaxNewSize=1024m -XX:+DisableExplicitGC"
>>
>> I'm using default settings for all of the tunable stuff at the bottom
>> of storage-conf.xml. I also selected my initial tokens to evenly
>> partition the key space when the cluster was bootstrapped. I am using
>> the RandomPartitioner.
>>
>> Now, about the test. Basically I am trying to get an idea of just how
>> fast I can make this thing go. I am writing ~250M data records into
>> the cluster, replicated at 3x, using Ran Tavory's Hector client
>> (Java), writing with ConsistencyLevel.ZERO and
>> FailoverPolicy.FAIL_FAST. The client is using 32 threads with 8
>> threads talking to each of the 4 nodes in the cluster. Records are
>> identified by a numeric id, and I'm writing them in batches of up to
>> 10k records per row, with each record in its own column. The row key
>> identifies the bucket into which records fall. So, records with ids 0
>> - 9999 are written to row "0", 10000 - 19999 are written to row
>> "10000", etc. Each record is a JSON object with ~10-20 fields.
>>
>> Records: {  // Column Family
>>   0 : {  // row key for the start of the bucket. Buckets span a range
>> of up to 10000 records
>>     1 : "{ /* some JSON */ }",  // Column for record with id=1
>>     3 : "{ /* some more JSON */ }",  // Column for record with id=3
>>    ...
>>    9999 : "{ /* ... */ }"
>>   },
>>  10000 : {  // row key for the start of the next bucket
>>    10001 : ...
>>    10004 :
>> }
>>
>> I am reading the data out of a local, sorted file on the client, so I
>> only write a row to Cassandra once all records for that row have been
>> read, and each row is written to exactly once. I'm using a
>> producer-consumer queue to pump data from the input reader thread to
>> the output writer threads. I found that I have to throttle the reader
>> thread heavily in order to get good behavior. So, if I make the reader
>> sleep for 7 seconds every 1M records, everything is fine - the data
>> loads in about an hour, half of which is spent by the reader thread
>> sleeping. In between the sleeps, I see ~40-50 MB/s throughput on the
>> client's network interface while the reader is not sleeping, and it
>> takes ~7-8 seconds to write each batch of 1M records.
>>
>> Now, if I remove the 7 second sleeps on the client side, things get
>> bad after the first ~8M records are written to the client. Write
>> throughput drops to <5 MB/s. I start seeing messages about nodes
>> disconnecting and reconnecting in Cassandra's system.log, as well as
>> lots of GC messages:
>>
>> ...
>>  INFO [Timer-1] 2010-04-06 04:03:27,178 Gossiper.java (line 179)
>> InetAddress /10.15.38.88 is now dead.
>>  INFO [GC inspection] 2010-04-06 04:03:30,259 GCInspector.java (line
>> 110) GC for ConcurrentMarkSweep: 2989 ms, 55326320 reclaimed leaving
>> 1035998648 used; max is 1211170816
>>  INFO [GC inspection] 2010-04-06 04:03:41,838 GCInspector.java (line
>> 110) GC for ConcurrentMarkSweep: 3004 ms, 24377240 reclaimed leaving
>> 1066120952 used; max is 1211170816
>>  INFO [Timer-1] 2010-04-06 04:03:44,136 Gossiper.java (line 179)
>> InetAddress /10.15.38.55 is now dead.
>>  INFO [GMFD:1] 2010-04-06 04:03:44,138 Gossiper.java (line 568)
>> InetAddress /10.15.38.55 is now UP
>>  INFO [GC inspection] 2010-04-06 04:03:52,957 GCInspector.java (line
>> 110) GC for ConcurrentMarkSweep: 2319 ms, 4504888 reclaimed leaving
>> 1086023832 used; max is 1211170816
>>  INFO [Timer-1] 2010-04-06 04:04:19,508 Gossiper.java (line 179)
>> InetAddress /10.15.38.242 is now dead.
>>  INFO [Timer-1] 2010-04-06 04:05:03,039 Gossiper.java (line 179)
>> InetAddress /10.15.38.55 is now dead.
>>  INFO [GMFD:1] 2010-04-06 04:05:03,041 Gossiper.java (line 568)
>> InetAddress /10.15.38.55 is now UP
>>  INFO [GC inspection] 2010-04-06 04:05:08,539 GCInspector.java (line
>> 110) GC for ConcurrentMarkSweep: 2375 ms, 39534920 reclaimed leaving
>> 1051620856 used; max is 1211170816
>> ...
>>
>> Finally followed by this and some/all nodes going down:
>>
>> ERROR [COMPACTION-POOL:1] 2010-04-06 04:05:05,475
>> DebuggableThreadPoolExecutor.java (line 94) Error in executor
>> futuretask
>> java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError:
>> Java heap space
>>        at java.util.concurrent.FutureTask$Sync.innerGet(Unknown Source)
>>        at java.util.concurrent.FutureTask.get(Unknown Source)
>>        at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.afterExecute(DebuggableThreadPoolExecutor.java:86)
>>        at org.apache.cassandra.db.CompactionManager$CompactionExecutor.afterExecute(CompactionManager.java:582)
>>        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
>>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>>        at java.lang.Thread.run(Unknown Source)
>> Caused by: java.lang.OutOfMemoryError: Java heap space
>>        at java.util.Arrays.copyOf(Unknown Source)
>>        at java.io.ByteArrayOutputStream.write(Unknown Source)
>>        at java.io.DataOutputStream.write(Unknown Source)
>>        at org.apache.cassandra.io.IteratingRow.echoData(IteratingRow.java:69)
>>        at org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:138)
>>        at org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:1)
>>        at org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73)
>>        at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
>>        at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
>>        at org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183)
>>        at org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94)
>>        at org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:299)
>>        at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:102)
>>        at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:1)
>>        at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
>>        at java.util.concurrent.FutureTask.run(Unknown Source)
>>        ... 3 more
>>
>> At first I thought that with ConsistencyLevel.ZERO I must be doing
>> async writes so Cassandra can't push back on the client threads (by
>> blocking them), thus the server is getting overwhelmed. But, I would
>> expect it to start dropping data and not crash in that case (after
>> all, I did say ZERO so I can't expect any reliability, right?).
>> However, I see similar slowdown / node dropout behavior when I set the
>> consistency level to ONE. Does Cassandra push back on writers under
>> heavy load? Is there some magic setting I need to tune to have it not
>> fall over? Do I just need a bigger cluster? Thanks in advance,
>>
>> -- Ilya
>>
>> P.S. I realize that it's still handling a LOT of data with just 4
>> nodes, and in practice nobody would run a system that gets 125k writes
>> per second on top of a 4 node cluster. I was just surprised that I
>> could make Cassandra fall over at all using a single client that's
>> pumping data at 40-50 MB/s.
>>
>

Re: Overwhelming a cluster with writes?

Posted by Boris Shulman <sh...@gmail.com>.

You are running out of memory on your nodes. Before the final crash
your nodes are probably slow  due to GC. What is your memtable size?
What cache options did you configure?

On Tue, Apr 6, 2010 at 7:31 AM, Ilya Maykov <iv...@gmail.com> wrote:
> Hi all,
>
> I've just started experimenting with Cassandra to get a feel for the
> system. I've set up a test cluster and to get a ballpark idea of its
> performance I wrote a simple tool to load some toy data into the
> system. Surprisingly, I am able to "overwhelm" my 4-node cluster with
> writes from a single client. I'm trying to figure out if this is a
> problem with my setup, if I'm hitting bugs in the Cassandra codebase,
> or if this is intended behavior. Sorry this email is kind of long,
> here is the TLDR version:
>
> While writing to Cassandra from a single node, I am able to get the
> cluster into a bad state, where nodes are randomly disconnecting from
> each other, write performance plummets, and sometimes nodes even
> crash. Further, the nodes do not recover as long as the writes
> continue (even at a much lower rate), and sometimes do not recover at
> all unless I restart them. I can get this to happen simply by throwing
> data at the cluster fast enough, and I'm wondering if this is a known
> issue or if I need to tweak my setup.
>
> Now, the details.
>
> First, a little bit about the setup:
>
> 4-node cluster of identical machines, running cassandra-0.6.0-rc1 with
> the fixes for CASSANDRA-933, CASSANDRA-934, and CASSANDRA-936 patched
> in. Node specs:
> 8-core Intel Xeon E5405@2.00GHz
> 8GB RAM
> 1Gbit ethernet
> Red Hat Linux 2.6.18
> JVM 1.6.0_19 64-bit
> 1TB spinning disk houses both commitlog and data directories (which I
> know is not ideal).
> The client machine is on the same local network and has very similar specs.
>
> The cassandra nodes are started with the following JVM options:
>
> ./cassandra JVM_OPTS="-Xms6144m -Xmx6144m -XX:+UseConcMarkSweepGC -d64
> -XX:NewSize=1024m -XX:MaxNewSize=1024m -XX:+DisableExplicitGC"
>
> I'm using default settings for all of the tunable stuff at the bottom
> of storage-conf.xml. I also selected my initial tokens to evenly
> partition the key space when the cluster was bootstrapped. I am using
> the RandomPartitioner.
>
> Now, about the test. Basically I am trying to get an idea of just how
> fast I can make this thing go. I am writing ~250M data records into
> the cluster, replicated at 3x, using Ran Tavory's Hector client
> (Java), writing with ConsistencyLevel.ZERO and
> FailoverPolicy.FAIL_FAST. The client is using 32 threads with 8
> threads talking to each of the 4 nodes in the cluster. Records are
> identified by a numeric id, and I'm writing them in batches of up to
> 10k records per row, with each record in its own column. The row key
> identifies the bucket into which records fall. So, records with ids 0
> - 9999 are written to row "0", 10000 - 19999 are written to row
> "10000", etc. Each record is a JSON object with ~10-20 fields.
>
> Records: {  // Column Family
>   0 : {  // row key for the start of the bucket. Buckets span a range
> of up to 10000 records
>     1 : "{ /* some JSON */ }",  // Column for record with id=1
>     3 : "{ /* some more JSON */ }",  // Column for record with id=3
>    ...
>    9999 : "{ /* ... */ }"
>   },
>  10000 : {  // row key for the start of the next bucket
>    10001 : ...
>    10004 :
> }
>
> I am reading the data out of a local, sorted file on the client, so I
> only write a row to Cassandra once all records for that row have been
> read, and each row is written to exactly once. I'm using a
> producer-consumer queue to pump data from the input reader thread to
> the output writer threads. I found that I have to throttle the reader
> thread heavily in order to get good behavior. So, if I make the reader
> sleep for 7 seconds every 1M records, everything is fine - the data
> loads in about an hour, half of which is spent by the reader thread
> sleeping. In between the sleeps, I see ~40-50 MB/s throughput on the
> client's network interface while the reader is not sleeping, and it
> takes ~7-8 seconds to write each batch of 1M records.
>
> Now, if I remove the 7 second sleeps on the client side, things get
> bad after the first ~8M records are written to the client. Write
> throughput drops to <5 MB/s. I start seeing messages about nodes
> disconnecting and reconnecting in Cassandra's system.log, as well as
> lots of GC messages:
>
> ...
>  INFO [Timer-1] 2010-04-06 04:03:27,178 Gossiper.java (line 179)
> InetAddress /10.15.38.88 is now dead.
>  INFO [GC inspection] 2010-04-06 04:03:30,259 GCInspector.java (line
> 110) GC for ConcurrentMarkSweep: 2989 ms, 55326320 reclaimed leaving
> 1035998648 used; max is 1211170816
>  INFO [GC inspection] 2010-04-06 04:03:41,838 GCInspector.java (line
> 110) GC for ConcurrentMarkSweep: 3004 ms, 24377240 reclaimed leaving
> 1066120952 used; max is 1211170816
>  INFO [Timer-1] 2010-04-06 04:03:44,136 Gossiper.java (line 179)
> InetAddress /10.15.38.55 is now dead.
>  INFO [GMFD:1] 2010-04-06 04:03:44,138 Gossiper.java (line 568)
> InetAddress /10.15.38.55 is now UP
>  INFO [GC inspection] 2010-04-06 04:03:52,957 GCInspector.java (line
> 110) GC for ConcurrentMarkSweep: 2319 ms, 4504888 reclaimed leaving
> 1086023832 used; max is 1211170816
>  INFO [Timer-1] 2010-04-06 04:04:19,508 Gossiper.java (line 179)
> InetAddress /10.15.38.242 is now dead.
>  INFO [Timer-1] 2010-04-06 04:05:03,039 Gossiper.java (line 179)
> InetAddress /10.15.38.55 is now dead.
>  INFO [GMFD:1] 2010-04-06 04:05:03,041 Gossiper.java (line 568)
> InetAddress /10.15.38.55 is now UP
>  INFO [GC inspection] 2010-04-06 04:05:08,539 GCInspector.java (line
> 110) GC for ConcurrentMarkSweep: 2375 ms, 39534920 reclaimed leaving
> 1051620856 used; max is 1211170816
> ...
>
> Finally followed by this and some/all nodes going down:
>
> ERROR [COMPACTION-POOL:1] 2010-04-06 04:05:05,475
> DebuggableThreadPoolExecutor.java (line 94) Error in executor
> futuretask
> java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError:
> Java heap space
>        at java.util.concurrent.FutureTask$Sync.innerGet(Unknown Source)
>        at java.util.concurrent.FutureTask.get(Unknown Source)
>        at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.afterExecute(DebuggableThreadPoolExecutor.java:86)
>        at org.apache.cassandra.db.CompactionManager$CompactionExecutor.afterExecute(CompactionManager.java:582)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source)
>        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>        at java.lang.Thread.run(Unknown Source)
> Caused by: java.lang.OutOfMemoryError: Java heap space
>        at java.util.Arrays.copyOf(Unknown Source)
>        at java.io.ByteArrayOutputStream.write(Unknown Source)
>        at java.io.DataOutputStream.write(Unknown Source)
>        at org.apache.cassandra.io.IteratingRow.echoData(IteratingRow.java:69)
>        at org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:138)
>        at org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:1)
>        at org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73)
>        at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
>        at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
>        at org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183)
>        at org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94)
>        at org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:299)
>        at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:102)
>        at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:1)
>        at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source)
>        at java.util.concurrent.FutureTask.run(Unknown Source)
>        ... 3 more
>
> At first I thought that with ConsistencyLevel.ZERO I must be doing
> async writes so Cassandra can't push back on the client threads (by
> blocking them), thus the server is getting overwhelmed. But, I would
> expect it to start dropping data and not crash in that case (after
> all, I did say ZERO so I can't expect any reliability, right?).
> However, I see similar slowdown / node dropout behavior when I set the
> consistency level to ONE. Does Cassandra push back on writers under
> heavy load? Is there some magic setting I need to tune to have it not
> fall over? Do I just need a bigger cluster? Thanks in advance,
>
> -- Ilya
>
> P.S. I realize that it's still handling a LOT of data with just 4
> nodes, and in practice nobody would run a system that gets 125k writes
> per second on top of a 4 node cluster. I was just surprised that I
> could make Cassandra fall over at all using a single client that's
> pumping data at 40-50 MB/s.
>