You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Philippe <wa...@gmail.com> on 2011/10/09 02:13:44 UTC

12-node cluster mystery

Dear all,
I've just fired up our production cluster : 12 nodes, RF=3 and I've run into
something I don't understand at all. Our test cluster was 3 nodes, RF=3
Test cluster was AMD opteron CPUs (6x2.33) w/ 32GB RAM while the production
cluster is core i5 (4x2.66) w/ 16 GB RAM.

I'm running the same import process using Hector as I did in August on the
test cluster, but this time, I get a lot of
211725 [pool-3-thread-1] WARN
me.prettyprint.cassandra.connection.HConnectionManager  - Exception:
me.prettyprint.hector.api.exceptions.HTimedOutException: TimedOutException()
        at
me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:40)
        at
me.prettyprint.cassandra.service.KeyspaceServiceImpl$1.execute(KeyspaceServiceImpl.java:97)
        at
me.prettyprint.cassandra.service.KeyspaceServiceImpl$1.execute(KeyspaceServiceImpl.java:90)
        at
me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:101)
        at
me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:219)
        at
me.prettyprint.cassandra.service.KeyspaceServiceImpl.operateWithFailover(KeyspaceServiceImpl.java:131)
        at
me.prettyprint.cassandra.service.KeyspaceServiceImpl.batchMutate(KeyspaceServiceImpl.java:102)
        at
me.prettyprint.cassandra.service.KeyspaceServiceImpl.batchMutate(KeyspaceServiceImpl.java:108)
        at
me.prettyprint.cassandra.model.MutatorImpl$3.doInKeyspace(MutatorImpl.java:222)
        at
me.prettyprint.cassandra.model.MutatorImpl$3.doInKeyspace(MutatorImpl.java:219)
        at
me.prettyprint.cassandra.model.KeyspaceOperationCallback.doInKeyspaceAndMeasure(KeyspaceOperationCallback.java:20)
        at
me.prettyprint.cassandra.model.ExecutingKeyspace.doExecute(ExecutingKeyspace.java:85)
        at
me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:219)
        at
com.sensorly.heatmap.rollups.cassandra.CassandraRollupWithCountersDao.executeMutator(CassandraRollupWithCountersDao.java:302)
        at
com.sensorly.heatmap.rollups.cassandra.LoaderCallable.loadRollup(LoaderCallable.java:112)
        at
com.sensorly.heatmap.rollups.cassandra.LoaderCallable.run(LoaderCallable.java:74)
        at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
        at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
        at java.util.concurrent.FutureTask.run(FutureTask.java:138)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
Caused by: TimedOutException()
        at
org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:19061)
        at
org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:1035)
        at
org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:1009)
        at
me.prettyprint.cassandra.service.KeyspaceServiceImpl$1.execute(KeyspaceServiceImpl.java:95)

I've lowered the number of concurrent threads to one or running it locally
on one of the nodes but it still doesn't improve.

   - vmstat shows nothing going on on the servers
   - the logs don't indicate anything
   - network traffic is below 1Mbit/s (I guess that's just gossip)
   - iostat shows no activity
   - nearly all of the servers' memory is free
   - tpstats shows that some mutations were dropped on a node.

I'm stumped... what could I have missed ?

Thanks
PS: @aaron, Richard & co : your suggestions to my previous questions are
being investigated, I'll report on my findings.

Re: 12-node cluster mystery

Posted by Jonathan Ellis <jb...@gmail.com>.
It looks like you're trying to use batches as a performance
optimization. Don't do that, it makes your load bursty.

On Sat, Oct 8, 2011 at 7:13 PM, Philippe <wa...@gmail.com> wrote:
> Dear all,
> I've just fired up our production cluster : 12 nodes, RF=3 and I've run into
> something I don't understand at all. Our test cluster was 3 nodes, RF=3
> Test cluster was AMD opteron CPUs (6x2.33) w/ 32GB RAM while the production
> cluster is core i5 (4x2.66) w/ 16 GB RAM.
>
> I'm running the same import process using Hector as I did in August on the
> test cluster, but this time, I get a lot of
> 211725 [pool-3-thread-1] WARN
> me.prettyprint.cassandra.connection.HConnectionManager  - Exception:
> me.prettyprint.hector.api.exceptions.HTimedOutException: TimedOutException()
>         at
> me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:40)
>         at
> me.prettyprint.cassandra.service.KeyspaceServiceImpl$1.execute(KeyspaceServiceImpl.java:97)
>         at
> me.prettyprint.cassandra.service.KeyspaceServiceImpl$1.execute(KeyspaceServiceImpl.java:90)
>         at
> me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:101)
>         at
> me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:219)
>         at
> me.prettyprint.cassandra.service.KeyspaceServiceImpl.operateWithFailover(KeyspaceServiceImpl.java:131)
>         at
> me.prettyprint.cassandra.service.KeyspaceServiceImpl.batchMutate(KeyspaceServiceImpl.java:102)
>         at
> me.prettyprint.cassandra.service.KeyspaceServiceImpl.batchMutate(KeyspaceServiceImpl.java:108)
>         at
> me.prettyprint.cassandra.model.MutatorImpl$3.doInKeyspace(MutatorImpl.java:222)
>         at
> me.prettyprint.cassandra.model.MutatorImpl$3.doInKeyspace(MutatorImpl.java:219)
>         at
> me.prettyprint.cassandra.model.KeyspaceOperationCallback.doInKeyspaceAndMeasure(KeyspaceOperationCallback.java:20)
>         at
> me.prettyprint.cassandra.model.ExecutingKeyspace.doExecute(ExecutingKeyspace.java:85)
>         at
> me.prettyprint.cassandra.model.MutatorImpl.execute(MutatorImpl.java:219)
>         at
> com.sensorly.heatmap.rollups.cassandra.CassandraRollupWithCountersDao.executeMutator(CassandraRollupWithCountersDao.java:302)
>         at
> com.sensorly.heatmap.rollups.cassandra.LoaderCallable.loadRollup(LoaderCallable.java:112)
>         at
> com.sensorly.heatmap.rollups.cassandra.LoaderCallable.run(LoaderCallable.java:74)
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>         at
> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: TimedOutException()
>         at
> org.apache.cassandra.thrift.Cassandra$batch_mutate_result.read(Cassandra.java:19061)
>         at
> org.apache.cassandra.thrift.Cassandra$Client.recv_batch_mutate(Cassandra.java:1035)
>         at
> org.apache.cassandra.thrift.Cassandra$Client.batch_mutate(Cassandra.java:1009)
>         at
> me.prettyprint.cassandra.service.KeyspaceServiceImpl$1.execute(KeyspaceServiceImpl.java:95)
>
> I've lowered the number of concurrent threads to one or running it locally
> on one of the nodes but it still doesn't improve.
>
> vmstat shows nothing going on on the servers
> the logs don't indicate anything
> network traffic is below 1Mbit/s (I guess that's just gossip)
> iostat shows no activity
> nearly all of the servers' memory is free
> tpstats shows that some mutations were dropped on a node.
>
> I'm stumped... what could I have missed ?
>
> Thanks
> PS: @aaron, Richard & co : your suggestions to my previous questions are
> being investigated, I'll report on my findings.
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com