You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Chris Were <ch...@gmail.com> on 2009/11/10 05:25:14 UTC

Timeout Exception

I'm getting a Timeout Exception every now and again (currently every couple
of minutes or so).

Using revision 833288. Quorum set to ONE. My cassandra instance has been
running for two days and the data directory is around 16GB. I'm not sure
what the problem is, but let me know of any tests I can do to help reduce
the problem further. There are two variations on the exception, I have
pasted them both below.

ERROR [pool-1-thread-63] 2009-11-09 20:17:27,579 Cassandra.java (line
org.apache.cassandra.service.Cassandra$Processor) Internal error processing
get_slice
java.lang.RuntimeException: java.util.concurrent.TimeoutException: Operation
timed out - received only 0 responses from  .
 at
org.apache.cassandra.service.CassandraServer.readColumnFamily(CassandraServer.java:103)
at
org.apache.cassandra.service.CassandraServer.getSlice(CassandraServer.java:177)
 at
org.apache.cassandra.service.CassandraServer.multigetSliceInternal(CassandraServer.java:252)
at
org.apache.cassandra.service.CassandraServer.get_slice(CassandraServer.java:215)
 at
org.apache.cassandra.service.Cassandra$Processor$get_slice.process(Cassandra.java:668)
at
org.apache.cassandra.service.Cassandra$Processor.process(Cassandra.java:624)
 at
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:636)
Caused by: java.util.concurrent.TimeoutException: Operation timed out -
received only 0 responses from  .
at
org.apache.cassandra.service.QuorumResponseHandler.get(QuorumResponseHandler.java:79)
 at
org.apache.cassandra.service.StorageProxy.strongRead(StorageProxy.java:408)
at
org.apache.cassandra.service.StorageProxy.readProtocol(StorageProxy.java:333)
 at
org.apache.cassandra.service.CassandraServer.readColumnFamily(CassandraServer.java:95)
... 9 more

ERROR [pool-1-thread-19] 2009-11-09 11:29:18,731 Cassandra.java (line
org.apache.cassandra.service.Cassandra$Processor) Internal error processing
get_slice
java.lang.RuntimeException: java.util.concurrent.TimeoutException: Operation
timed out - received only 1 responses from /10.121.217.5 .
        at
org.apache.cassandra.service.CassandraServer.readColumnFamily(CassandraServer.java:103)
        at
org.apache.cassandra.service.CassandraServer.getSlice(CassandraServer.java:177)
        at
org.apache.cassandra.service.CassandraServer.multigetSliceInternal(CassandraServer.java:252)
        at
org.apache.cassandra.service.CassandraServer.get_slice(CassandraServer.java:215)
        at
org.apache.cassandra.service.Cassandra$Processor$get_slice.process(Cassandra.java:668)
        at
org.apache.cassandra.service.Cassandra$Processor.process(Cassandra.java:624)
        at
org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:636)
Caused by: java.util.concurrent.TimeoutException: Operation timed out -
received only 1 responses from /10.121.217.5 .
        at
org.apache.cassandra.service.QuorumResponseHandler.get(QuorumResponseHandler.java:79)
        at
org.apache.cassandra.service.StorageProxy.strongRead(StorageProxy.java:408)
        at
org.apache.cassandra.service.StorageProxy.readProtocol(StorageProxy.java:333)
        at
org.apache.cassandra.service.CassandraServer.readColumnFamily(CassandraServer.java:95)
        ... 9 more

Cheers,
Chris

Re: Timeout Exception

Posted by Richard grossman <ri...@gmail.com>.

For me the timeout occurs when I've a lot of request to the server. It was
working at 100% CPU and it seems that cassandra was receiving request and
keep them into the queue for more than 5000 millis then all the request was
rejected with timeout.

I can't think about another solution or to increase the timeout (not
ideally) or add more cassandra instance to answer high requests number
faster.


On Tue, Nov 10, 2009 at 7:22 AM, Jonathan Ellis <jb...@gmail.com> wrote:

> What's causing the timeout?  An error on the source node, or just
> slowness?  If the latter, how many rows are in your multiget?
>
> On Mon, Nov 9, 2009 at 10:25 PM, Chris Were <ch...@gmail.com> wrote:
> >
> > I'm getting a Timeout Exception every now and again (currently every
> couple
> > of minutes or so).
> > Using revision 833288. Quorum set to ONE. My cassandra instance has been
> > running for two days and the data directory is around 16GB. I'm not sure
> > what the problem is, but let me know of any tests I can do to help reduce
> > the problem further. There are two variations on the exception, I have
> > pasted them both below.
> > ERROR [pool-1-thread-63] 2009-11-09 20:17:27,579 Cassandra.java (line
> > org.apache.cassandra.service.Cassandra$Processor) Internal error
> processing
> > get_slice
> > java.lang.RuntimeException: java.util.concurrent.TimeoutException:
> Operation
> > timed out - received only 0 responses from  .
> > at
> >
> org.apache.cassandra.service.CassandraServer.readColumnFamily(CassandraServer.java:103)
> > at
> >
> org.apache.cassandra.service.CassandraServer.getSlice(CassandraServer.java:177)
> > at
> >
> org.apache.cassandra.service.CassandraServer.multigetSliceInternal(CassandraServer.java:252)
> > at
> >
> org.apache.cassandra.service.CassandraServer.get_slice(CassandraServer.java:215)
> > at
> >
> org.apache.cassandra.service.Cassandra$Processor$get_slice.process(Cassandra.java:668)
> > at
> >
> org.apache.cassandra.service.Cassandra$Processor.process(Cassandra.java:624)
> > at
> >
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> > at java.lang.Thread.run(Thread.java:636)
> > Caused by: java.util.concurrent.TimeoutException: Operation timed out -
> > received only 0 responses from  .
> > at
> >
> org.apache.cassandra.service.QuorumResponseHandler.get(QuorumResponseHandler.java:79)
> > at
> >
> org.apache.cassandra.service.StorageProxy.strongRead(StorageProxy.java:408)
> > at
> >
> org.apache.cassandra.service.StorageProxy.readProtocol(StorageProxy.java:333)
> > at
> >
> org.apache.cassandra.service.CassandraServer.readColumnFamily(CassandraServer.java:95)
> > ... 9 more
> > ERROR [pool-1-thread-19] 2009-11-09 11:29:18,731 Cassandra.java (line
> > org.apache.cassandra.service.Cassandra$Processor) Internal error
> processing
> > get_slice
> > java.lang.RuntimeException: java.util.concurrent.TimeoutException:
> Operation
> > timed out - received only 1 responses from /10.121.217.5 .
> >         at
> >
> org.apache.cassandra.service.CassandraServer.readColumnFamily(CassandraServer.java:103)
> >         at
> >
> org.apache.cassandra.service.CassandraServer.getSlice(CassandraServer.java:177)
> >         at
> >
> org.apache.cassandra.service.CassandraServer.multigetSliceInternal(CassandraServer.java:252)
> >         at
> >
> org.apache.cassandra.service.CassandraServer.get_slice(CassandraServer.java:215)
> >         at
> >
> org.apache.cassandra.service.Cassandra$Processor$get_slice.process(Cassandra.java:668)
> >         at
> >
> org.apache.cassandra.service.Cassandra$Processor.process(Cassandra.java:624)
> >         at
> >
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253)
> >         at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> >         at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> >         at java.lang.Thread.run(Thread.java:636)
> > Caused by: java.util.concurrent.TimeoutException: Operation timed out -
> > received only 1 responses from /10.121.217.5 .
> >         at
> >
> org.apache.cassandra.service.QuorumResponseHandler.get(QuorumResponseHandler.java:79)
> >         at
> >
> org.apache.cassandra.service.StorageProxy.strongRead(StorageProxy.java:408)
> >         at
> >
> org.apache.cassandra.service.StorageProxy.readProtocol(StorageProxy.java:333)
> >         at
> >
> org.apache.cassandra.service.CassandraServer.readColumnFamily(CassandraServer.java:95)
> >         ... 9 more
> > Cheers,
> > Chris
>

Re: Timeout Exception

Posted by Jonathan Ellis <jb...@gmail.com>.

I suspect there may be a connection between "the server was busy with
something and didn't die immediately with kill -INT" and "I was
getting timeout exceptions."

On Tue, Nov 10, 2009 at 2:42 PM, Chris Were <ch...@gmail.com> wrote:
> As in... kill -9
>
> On Tue, Nov 10, 2009 at 12:38 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>
>> it's supposed to be kill-only.  curious what shutdown you were trying.
>>
>> On Tue, Nov 10, 2009 at 2:19 PM, Chris Were <ch...@gmail.com> wrote:
>> > I've restarted with debugging and it seems to be ok for the time being.
>> > Interesting to note that cassandra wouldn't shut down properly and had
>> > to be
>> > killed.
>> >
>> > On Tue, Nov 10, 2009 at 11:55 AM, Jonathan Ellis <jb...@gmail.com>
>> > wrote:
>> >>
>> >> if you're timing out doing a slice on 10 columns w/ 10% cpu used,
>> >> something is broken
>> >>
>> >> is it consistent as to which keys this happens on?  try turning on
>> >> debug logging and seeing where the latency is coming from.
>> >>
>> >> On Tue, Nov 10, 2009 at 1:53 PM, Chris Were <ch...@gmail.com>
>> >> wrote:
>> >> >
>> >> > On Tue, Nov 10, 2009 at 11:50 AM, Jonathan Ellis <jb...@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> On Tue, Nov 10, 2009 at 1:49 PM, Chris Were <ch...@gmail.com>
>> >> >> wrote:
>> >> >> > Maybe... but it's not just multigets, it also happens when
>> >> >> > retreiving
>> >> >> > one
>> >> >> > row with get_slice.
>> >> >>
>> >> >> how many of the 3M columns are you trying to slice at once?
>> >> >
>> >> > Sorry, I must have mixed up the terminology.
>> >> > There's ~3M keys, but less than 10 columns in each. The get_slice
>> >> > calls
>> >> > are
>> >> > to retreive all the columns (10) for a given key.
>> >
>> >
>
>

Re: Timeout Exception

Posted by Chris Were <ch...@gmail.com>.

I've restarted with debugging and it seems to be ok for the time being.
Interesting to note that cassandra wouldn't shut down properly and had to be
killed.

On Tue, Nov 10, 2009 at 11:55 AM, Jonathan Ellis <jb...@gmail.com> wrote:

> if you're timing out doing a slice on 10 columns w/ 10% cpu used,
> something is broken
>
> is it consistent as to which keys this happens on?  try turning on
> debug logging and seeing where the latency is coming from.
>
> On Tue, Nov 10, 2009 at 1:53 PM, Chris Were <ch...@gmail.com> wrote:
> >
> > On Tue, Nov 10, 2009 at 11:50 AM, Jonathan Ellis <jb...@gmail.com>
> wrote:
> >>
> >> On Tue, Nov 10, 2009 at 1:49 PM, Chris Were <ch...@gmail.com>
> wrote:
> >> > Maybe... but it's not just multigets, it also happens when retreiving
> >> > one
> >> > row with get_slice.
> >>
> >> how many of the 3M columns are you trying to slice at once?
> >
> > Sorry, I must have mixed up the terminology.
> > There's ~3M keys, but less than 10 columns in each. The get_slice calls
> are
> > to retreive all the columns (10) for a given key.
>

Re: Timeout Exception

Posted by Igor Katkov <ik...@gmail.com>.

On most reasonable hardware (for Cassandra) JVM will be running in server
mode by default.
http://java.sun.com/j2se/1.5.0/docs/guide/vm/server-class.html

On Mon, Nov 16, 2009 at 9:12 PM, Chris Were <ch...@gmail.com> wrote:

> Reading more on JVM GC led me to investigate the java -server flag (
> http://stackoverflow.com/questions/198577/real-differences-between-java-server-and-java-client
> )
>
> From what I can see cassandra's startup scripts don't invoke this mode, or
> did I miss it?
>
> Chris.
>
>
> On Mon, Nov 16, 2009 at 10:33 AM, Freeman, Tim <ti...@hp.com> wrote:
>
>>  You'll have to stop the swapping somehow.  Maybe you can install more
>> memory, maybe you can run Cassandra smaller, maybe you can get some other
>> process on the machine to be smaller or on some other machine, maybe you can
>> move Cassandra to some other machine with more available physical memory.
>>
>>
>>
>> I don't have experience with running Cassandra smaller than the
>> recommended size, so one of those options might not work.
>>
>>
>>
>> Caching database information in swapped-out pages usually isn't a win.  To
>> a first approximation, you need an I/O to fetch the swapped-out page, but
>> you'd need an I/O anyway to get the information from the database.  Swapping
>> on modern machines usually isn't a win in general -- Memory got bigger and
>> CPU's got faster in the last decade, but disks didn't get much faster.
>>
>>
>>
>> Tim Freeman
>> Email: tim.freeman@hp.com
>> Desk in Palo Alto: (650) 857-2581
>> Home: (408) 774-1298
>> Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and
>> Thursday; call my desk instead.)
>>
>>
>>
>> *From:* Chris Were [mailto:chris.were@gmail.com]
>> *Sent:* Monday, November 16, 2009 10:13 AM
>> *To:* cassandra-user@incubator.apache.org
>> *Subject:* Re: Timeout Exception
>>
>>
>>
>> Hi Tim,
>>
>>
>>
>> Thanks for the great pointers.
>>
>>
>>
>> si, so are regularly in the 100-2000 range. I'll need to Google more about
>> what these mean etc, but are you effectively saying to tell cassandra to use
>> less memory? Cassandra is the only Java App running on the server.
>>
>>
>>
>> Cheers,
>>
>> Chris
>>
>> On Mon, Nov 16, 2009 at 9:59 AM, Freeman, Tim <ti...@hp.com> wrote:
>>
>> I'm running 0.4.1.  I used to get timeouts, then I changed my timeout from
>> 5 seconds to 30 seconds and I get no more timeouts.  The relevant line from
>> storage-conf.xml is:
>>
>>
>>
>>   <RpcTimeoutInMillis>30000</RpcTimeoutInMillis>
>>
>>
>>
>> The maximum latency is often just over 5 seconds in the worst case when I
>> fetch thousands of records, so default timeout of 5 seconds happens to be a
>> little bit too low for me.  My records are ~100Kbytes each.  You may get
>> different results if your records are much larger or much smaller.
>>
>>
>>
>> The other issue I was having a few days ago was that the machine was page
>> faulting so garbage collections were taking forever.  Some GC's took 20
>> minutes in another Java process.  I didn't have verbose:gc turned on in
>> Cassandra so I'm not sure what the score was there, but there's little
>> reason to expect it to be qualitatively better, since it's pretty random
>> which process gets some of its pages swapped out.  On a Linux machine, run
>> "vmstat 5" when your machine is loaded and if you see numbers greater than 0
>> in the "si" and "so" columns in rows after the first, tell one of your Java
>> processes to take less memory.
>>
>>
>>
>> Tim Freeman
>> Email: tim.freeman@hp.com
>> Desk in Palo Alto: (650) 857-2581
>> Home: (408) 774-1298
>> Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and
>> Thursday; call my desk instead.)
>>
>>
>>
>> *From:* Chris Were [mailto:chris.were@gmail.com]
>> *Sent:* Monday, November 16, 2009 9:47 AM
>> *To:* Jonathan Ellis
>> *Cc:* cassandra-user@incubator.apache.org
>> *Subject:* Re: Timeout Exception
>>
>>
>>
>> I turned on debug logging for a few days and timeouts happened across
>> pretty much all requests. I couldn't see any particular request that was
>> consistently the problem.
>>
>>
>>
>> After some experimenting it seems that shutting down cassandra and
>> restarting resolves the problem. Once it hits the JVM memory limit however,
>> the timeouts start again. I have read the page on MemTable thresholds and
>> have tried thresholds of 32MB, 64MB and 128MB with no noticeable difference.
>> Cassandra is set to use 7GB of memory. I have 12 CF's, however only 6 of
>> those have lots of data.
>>
>>
>>
>> Cheers,
>>
>> Chris
>>
>> On Tue, Nov 10, 2009 at 11:55 AM, Jonathan Ellis <jb...@gmail.com>
>> wrote:
>>
>> if you're timing out doing a slice on 10 columns w/ 10% cpu used,
>> something is broken
>>
>> is it consistent as to which keys this happens on?  try turning on
>> debug logging and seeing where the latency is coming from.
>>
>>
>> On Tue, Nov 10, 2009 at 1:53 PM, Chris Were <ch...@gmail.com> wrote:
>> >
>> > On Tue, Nov 10, 2009 at 11:50 AM, Jonathan Ellis <jb...@gmail.com>
>> wrote:
>> >>
>> >> On Tue, Nov 10, 2009 at 1:49 PM, Chris Were <ch...@gmail.com>
>> wrote:
>> >> > Maybe... but it's not just multigets, it also happens when retreiving
>> >> > one
>> >> > row with get_slice.
>> >>
>> >> how many of the 3M columns are you trying to slice at once?
>> >
>> > Sorry, I must have mixed up the terminology.
>> > There's ~3M keys, but less than 10 columns in each. The get_slice calls
>> are
>> > to retreive all the columns (10) for a given key.
>>
>>
>>
>>
>>
>
>

Re: Timeout Exception

Posted by Igor Katkov <ik...@gmail.com>.

On most resolvable hardware (for Cassandra) JVM will be running in server
mode by default.
http://java.sun.com/j2se/1.5.0/docs/guide/vm/server-class.html

On Mon, Nov 16, 2009 at 9:12 PM, Chris Were <ch...@gmail.com> wrote:

> Reading more on JVM GC led me to investigate the java -server flag (
> http://stackoverflow.com/questions/198577/real-differences-between-java-server-and-java-client
> )
>
> From what I can see cassandra's startup scripts don't invoke this mode, or
> did I miss it?
>
> Chris.
>
>
> On Mon, Nov 16, 2009 at 10:33 AM, Freeman, Tim <ti...@hp.com> wrote:
>
>>  You'll have to stop the swapping somehow.  Maybe you can install more
>> memory, maybe you can run Cassandra smaller, maybe you can get some other
>> process on the machine to be smaller or on some other machine, maybe you can
>> move Cassandra to some other machine with more available physical memory.
>>
>>
>>
>> I don't have experience with running Cassandra smaller than the
>> recommended size, so one of those options might not work.
>>
>>
>>
>> Caching database information in swapped-out pages usually isn't a win.  To
>> a first approximation, you need an I/O to fetch the swapped-out page, but
>> you'd need an I/O anyway to get the information from the database.  Swapping
>> on modern machines usually isn't a win in general -- Memory got bigger and
>> CPU's got faster in the last decade, but disks didn't get much faster.
>>
>>
>>
>> Tim Freeman
>> Email: tim.freeman@hp.com
>> Desk in Palo Alto: (650) 857-2581
>> Home: (408) 774-1298
>> Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and
>> Thursday; call my desk instead.)
>>
>>
>>
>> *From:* Chris Were [mailto:chris.were@gmail.com]
>> *Sent:* Monday, November 16, 2009 10:13 AM
>> *To:* cassandra-user@incubator.apache.org
>> *Subject:* Re: Timeout Exception
>>
>>
>>
>> Hi Tim,
>>
>>
>>
>> Thanks for the great pointers.
>>
>>
>>
>> si, so are regularly in the 100-2000 range. I'll need to Google more about
>> what these mean etc, but are you effectively saying to tell cassandra to use
>> less memory? Cassandra is the only Java App running on the server.
>>
>>
>>
>> Cheers,
>>
>> Chris
>>
>> On Mon, Nov 16, 2009 at 9:59 AM, Freeman, Tim <ti...@hp.com> wrote:
>>
>> I'm running 0.4.1.  I used to get timeouts, then I changed my timeout from
>> 5 seconds to 30 seconds and I get no more timeouts.  The relevant line from
>> storage-conf.xml is:
>>
>>
>>
>>   <RpcTimeoutInMillis>30000</RpcTimeoutInMillis>
>>
>>
>>
>> The maximum latency is often just over 5 seconds in the worst case when I
>> fetch thousands of records, so default timeout of 5 seconds happens to be a
>> little bit too low for me.  My records are ~100Kbytes each.  You may get
>> different results if your records are much larger or much smaller.
>>
>>
>>
>> The other issue I was having a few days ago was that the machine was page
>> faulting so garbage collections were taking forever.  Some GC's took 20
>> minutes in another Java process.  I didn't have verbose:gc turned on in
>> Cassandra so I'm not sure what the score was there, but there's little
>> reason to expect it to be qualitatively better, since it's pretty random
>> which process gets some of its pages swapped out.  On a Linux machine, run
>> "vmstat 5" when your machine is loaded and if you see numbers greater than 0
>> in the "si" and "so" columns in rows after the first, tell one of your Java
>> processes to take less memory.
>>
>>
>>
>> Tim Freeman
>> Email: tim.freeman@hp.com
>> Desk in Palo Alto: (650) 857-2581
>> Home: (408) 774-1298
>> Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and
>> Thursday; call my desk instead.)
>>
>>
>>
>> *From:* Chris Were [mailto:chris.were@gmail.com]
>> *Sent:* Monday, November 16, 2009 9:47 AM
>> *To:* Jonathan Ellis
>> *Cc:* cassandra-user@incubator.apache.org
>> *Subject:* Re: Timeout Exception
>>
>>
>>
>> I turned on debug logging for a few days and timeouts happened across
>> pretty much all requests. I couldn't see any particular request that was
>> consistently the problem.
>>
>>
>>
>> After some experimenting it seems that shutting down cassandra and
>> restarting resolves the problem. Once it hits the JVM memory limit however,
>> the timeouts start again. I have read the page on MemTable thresholds and
>> have tried thresholds of 32MB, 64MB and 128MB with no noticeable difference.
>> Cassandra is set to use 7GB of memory. I have 12 CF's, however only 6 of
>> those have lots of data.
>>
>>
>>
>> Cheers,
>>
>> Chris
>>
>> On Tue, Nov 10, 2009 at 11:55 AM, Jonathan Ellis <jb...@gmail.com>
>> wrote:
>>
>> if you're timing out doing a slice on 10 columns w/ 10% cpu used,
>> something is broken
>>
>> is it consistent as to which keys this happens on?  try turning on
>> debug logging and seeing where the latency is coming from.
>>
>>
>> On Tue, Nov 10, 2009 at 1:53 PM, Chris Were <ch...@gmail.com> wrote:
>> >
>> > On Tue, Nov 10, 2009 at 11:50 AM, Jonathan Ellis <jb...@gmail.com>
>> wrote:
>> >>
>> >> On Tue, Nov 10, 2009 at 1:49 PM, Chris Were <ch...@gmail.com>
>> wrote:
>> >> > Maybe... but it's not just multigets, it also happens when retreiving
>> >> > one
>> >> > row with get_slice.
>> >>
>> >> how many of the 3M columns are you trying to slice at once?
>> >
>> > Sorry, I must have mixed up the terminology.
>> > There's ~3M keys, but less than 10 columns in each. The get_slice calls
>> are
>> > to retreive all the columns (10) for a given key.
>>
>>
>>
>>
>>
>
>

Re: Timeout Exception

Posted by Chris Were <ch...@gmail.com>.

Reading more on JVM GC led me to investigate the java -server flag (
http://stackoverflow.com/questions/198577/real-differences-between-java-server-and-java-client
)

>From what I can see cassandra's startup scripts don't invoke this mode, or
did I miss it?

Chris.

On Mon, Nov 16, 2009 at 10:33 AM, Freeman, Tim <ti...@hp.com> wrote:

>  You'll have to stop the swapping somehow.  Maybe you can install more
> memory, maybe you can run Cassandra smaller, maybe you can get some other
> process on the machine to be smaller or on some other machine, maybe you can
> move Cassandra to some other machine with more available physical memory.
>
>
>
> I don't have experience with running Cassandra smaller than the recommended
> size, so one of those options might not work.
>
>
>
> Caching database information in swapped-out pages usually isn't a win.  To
> a first approximation, you need an I/O to fetch the swapped-out page, but
> you'd need an I/O anyway to get the information from the database.  Swapping
> on modern machines usually isn't a win in general -- Memory got bigger and
> CPU's got faster in the last decade, but disks didn't get much faster.
>
>
>
> Tim Freeman
> Email: tim.freeman@hp.com
> Desk in Palo Alto: (650) 857-2581
> Home: (408) 774-1298
> Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and
> Thursday; call my desk instead.)
>
>
>
> *From:* Chris Were [mailto:chris.were@gmail.com]
> *Sent:* Monday, November 16, 2009 10:13 AM
> *To:* cassandra-user@incubator.apache.org
> *Subject:* Re: Timeout Exception
>
>
>
> Hi Tim,
>
>
>
> Thanks for the great pointers.
>
>
>
> si, so are regularly in the 100-2000 range. I'll need to Google more about
> what these mean etc, but are you effectively saying to tell cassandra to use
> less memory? Cassandra is the only Java App running on the server.
>
>
>
> Cheers,
>
> Chris
>
> On Mon, Nov 16, 2009 at 9:59 AM, Freeman, Tim <ti...@hp.com> wrote:
>
> I'm running 0.4.1.  I used to get timeouts, then I changed my timeout from
> 5 seconds to 30 seconds and I get no more timeouts.  The relevant line from
> storage-conf.xml is:
>
>
>
>   <RpcTimeoutInMillis>30000</RpcTimeoutInMillis>
>
>
>
> The maximum latency is often just over 5 seconds in the worst case when I
> fetch thousands of records, so default timeout of 5 seconds happens to be a
> little bit too low for me.  My records are ~100Kbytes each.  You may get
> different results if your records are much larger or much smaller.
>
>
>
> The other issue I was having a few days ago was that the machine was page
> faulting so garbage collections were taking forever.  Some GC's took 20
> minutes in another Java process.  I didn't have verbose:gc turned on in
> Cassandra so I'm not sure what the score was there, but there's little
> reason to expect it to be qualitatively better, since it's pretty random
> which process gets some of its pages swapped out.  On a Linux machine, run
> "vmstat 5" when your machine is loaded and if you see numbers greater than 0
> in the "si" and "so" columns in rows after the first, tell one of your Java
> processes to take less memory.
>
>
>
> Tim Freeman
> Email: tim.freeman@hp.com
> Desk in Palo Alto: (650) 857-2581
> Home: (408) 774-1298
> Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and
> Thursday; call my desk instead.)
>
>
>
> *From:* Chris Were [mailto:chris.were@gmail.com]
> *Sent:* Monday, November 16, 2009 9:47 AM
> *To:* Jonathan Ellis
> *Cc:* cassandra-user@incubator.apache.org
> *Subject:* Re: Timeout Exception
>
>
>
> I turned on debug logging for a few days and timeouts happened across
> pretty much all requests. I couldn't see any particular request that was
> consistently the problem.
>
>
>
> After some experimenting it seems that shutting down cassandra and
> restarting resolves the problem. Once it hits the JVM memory limit however,
> the timeouts start again. I have read the page on MemTable thresholds and
> have tried thresholds of 32MB, 64MB and 128MB with no noticeable difference.
> Cassandra is set to use 7GB of memory. I have 12 CF's, however only 6 of
> those have lots of data.
>
>
>
> Cheers,
>
> Chris
>
> On Tue, Nov 10, 2009 at 11:55 AM, Jonathan Ellis <jb...@gmail.com>
> wrote:
>
> if you're timing out doing a slice on 10 columns w/ 10% cpu used,
> something is broken
>
> is it consistent as to which keys this happens on?  try turning on
> debug logging and seeing where the latency is coming from.
>
>
> On Tue, Nov 10, 2009 at 1:53 PM, Chris Were <ch...@gmail.com> wrote:
> >
> > On Tue, Nov 10, 2009 at 11:50 AM, Jonathan Ellis <jb...@gmail.com>
> wrote:
> >>
> >> On Tue, Nov 10, 2009 at 1:49 PM, Chris Were <ch...@gmail.com>
> wrote:
> >> > Maybe... but it's not just multigets, it also happens when retreiving
> >> > one
> >> > row with get_slice.
> >>
> >> how many of the 3M columns are you trying to slice at once?
> >
> > Sorry, I must have mixed up the terminology.
> > There's ~3M keys, but less than 10 columns in each. The get_slice calls
> are
> > to retreive all the columns (10) for a given key.
>
>
>
>
>

RE: Timeout Exception

Posted by "Freeman, Tim" <ti...@hp.com>.

You'll have to stop the swapping somehow.  Maybe you can install more memory, maybe you can run Cassandra smaller, maybe you can get some other process on the machine to be smaller or on some other machine, maybe you can move Cassandra to some other machine with more available physical memory.

I don't have experience with running Cassandra smaller than the recommended size, so one of those options might not work.

Caching database information in swapped-out pages usually isn't a win.  To a first approximation, you need an I/O to fetch the swapped-out page, but you'd need an I/O anyway to get the information from the database.  Swapping on modern machines usually isn't a win in general -- Memory got bigger and CPU's got faster in the last decade, but disks didn't get much faster.

Tim Freeman
Email: tim.freeman@hp.com<ma...@hp.com>
Desk in Palo Alto: (650) 857-2581
Home: (408) 774-1298
Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and Thursday; call my desk instead.)

From: Chris Were [mailto:chris.were@gmail.com]
Sent: Monday, November 16, 2009 10:13 AM
To: cassandra-user@incubator.apache.org
Subject: Re: Timeout Exception

Hi Tim,

Thanks for the great pointers.

si, so are regularly in the 100-2000 range. I'll need to Google more about what these mean etc, but are you effectively saying to tell cassandra to use less memory? Cassandra is the only Java App running on the server.

Cheers,
Chris
On Mon, Nov 16, 2009 at 9:59 AM, Freeman, Tim <ti...@hp.com>> wrote:
I'm running 0.4.1.  I used to get timeouts, then I changed my timeout from 5 seconds to 30 seconds and I get no more timeouts.  The relevant line from storage-conf.xml is:

  <RpcTimeoutInMillis>30000</RpcTimeoutInMillis>

The maximum latency is often just over 5 seconds in the worst case when I fetch thousands of records, so default timeout of 5 seconds happens to be a little bit too low for me.  My records are ~100Kbytes each.  You may get different results if your records are much larger or much smaller.

The other issue I was having a few days ago was that the machine was page faulting so garbage collections were taking forever.  Some GC's took 20 minutes in another Java process.  I didn't have verbose:gc turned on in Cassandra so I'm not sure what the score was there, but there's little reason to expect it to be qualitatively better, since it's pretty random which process gets some of its pages swapped out.  On a Linux machine, run "vmstat 5" when your machine is loaded and if you see numbers greater than 0 in the "si" and "so" columns in rows after the first, tell one of your Java processes to take less memory.

Tim Freeman
Email: tim.freeman@hp.com<ma...@hp.com>
Desk in Palo Alto: (650) 857-2581
Home: (408) 774-1298
Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and Thursday; call my desk instead.)

From: Chris Were [mailto:chris.were@gmail.com<ma...@gmail.com>]
Sent: Monday, November 16, 2009 9:47 AM
To: Jonathan Ellis
Cc: cassandra-user@incubator.apache.org<ma...@incubator.apache.org>
Subject: Re: Timeout Exception

I turned on debug logging for a few days and timeouts happened across pretty much all requests. I couldn't see any particular request that was consistently the problem.

After some experimenting it seems that shutting down cassandra and restarting resolves the problem. Once it hits the JVM memory limit however, the timeouts start again. I have read the page on MemTable thresholds and have tried thresholds of 32MB, 64MB and 128MB with no noticeable difference. Cassandra is set to use 7GB of memory. I have 12 CF's, however only 6 of those have lots of data.

Cheers,
Chris
On Tue, Nov 10, 2009 at 11:55 AM, Jonathan Ellis <jb...@gmail.com>> wrote:
if you're timing out doing a slice on 10 columns w/ 10% cpu used,
something is broken

is it consistent as to which keys this happens on?  try turning on
debug logging and seeing where the latency is coming from.

On Tue, Nov 10, 2009 at 1:53 PM, Chris Were <ch...@gmail.com>> wrote:
>
> On Tue, Nov 10, 2009 at 11:50 AM, Jonathan Ellis <jb...@gmail.com>> wrote:
>>
>> On Tue, Nov 10, 2009 at 1:49 PM, Chris Were <ch...@gmail.com>> wrote:
>> > Maybe... but it's not just multigets, it also happens when retreiving
>> > one
>> > row with get_slice.
>>
>> how many of the 3M columns are you trying to slice at once?
>
> Sorry, I must have mixed up the terminology.
> There's ~3M keys, but less than 10 columns in each. The get_slice calls are
> to retreive all the columns (10) for a given key.

Re: Timeout Exception

Posted by Jonathan Ellis <jb...@gmail.com>.

Yes, using a larger heap than you need can be bad from a GC latency
point of view.

Upgrading to 0.4.2 will also help since we have better default GC options.

On Mon, Nov 16, 2009 at 12:13 PM, Chris Were <ch...@gmail.com> wrote:
> Hi Tim,
> Thanks for the great pointers.
> si, so are regularly in the 100-2000 range. I'll need to Google more about
> what these mean etc, but are you effectively saying to tell cassandra to use
> less memory? Cassandra is the only Java App running on the server.
> Cheers,
> Chris
>
> On Mon, Nov 16, 2009 at 9:59 AM, Freeman, Tim <ti...@hp.com> wrote:
>>
>> I'm running 0.4.1.  I used to get timeouts, then I changed my timeout from
>> 5 seconds to 30 seconds and I get no more timeouts.  The relevant line from
>> storage-conf.xml is:
>>
>>
>>
>>   <RpcTimeoutInMillis>30000</RpcTimeoutInMillis>
>>
>>
>>
>> The maximum latency is often just over 5 seconds in the worst case when I
>> fetch thousands of records, so default timeout of 5 seconds happens to be a
>> little bit too low for me.  My records are ~100Kbytes each.  You may get
>> different results if your records are much larger or much smaller.
>>
>>
>>
>> The other issue I was having a few days ago was that the machine was page
>> faulting so garbage collections were taking forever.  Some GC's took 20
>> minutes in another Java process.  I didn't have verbose:gc turned on in
>> Cassandra so I'm not sure what the score was there, but there's little
>> reason to expect it to be qualitatively better, since it's pretty random
>> which process gets some of its pages swapped out.  On a Linux machine, run
>> "vmstat 5" when your machine is loaded and if you see numbers greater than 0
>> in the "si" and "so" columns in rows after the first, tell one of your Java
>> processes to take less memory.
>>
>>
>>
>> Tim Freeman
>> Email: tim.freeman@hp.com
>> Desk in Palo Alto: (650) 857-2581
>> Home: (408) 774-1298
>> Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and
>> Thursday; call my desk instead.)
>>
>>
>>
>> From: Chris Were [mailto:chris.were@gmail.com]
>> Sent: Monday, November 16, 2009 9:47 AM
>> To: Jonathan Ellis
>> Cc: cassandra-user@incubator.apache.org
>> Subject: Re: Timeout Exception
>>
>>
>>
>> I turned on debug logging for a few days and timeouts happened across
>> pretty much all requests. I couldn't see any particular request that was
>> consistently the problem.
>>
>>
>>
>> After some experimenting it seems that shutting down cassandra and
>> restarting resolves the problem. Once it hits the JVM memory limit however,
>> the timeouts start again. I have read the page on MemTable thresholds and
>> have tried thresholds of 32MB, 64MB and 128MB with no noticeable difference.
>> Cassandra is set to use 7GB of memory. I have 12 CF's, however only 6 of
>> those have lots of data.
>>
>>
>>
>> Cheers,
>>
>> Chris
>>
>> On Tue, Nov 10, 2009 at 11:55 AM, Jonathan Ellis <jb...@gmail.com>
>> wrote:
>>
>> if you're timing out doing a slice on 10 columns w/ 10% cpu used,
>> something is broken
>>
>> is it consistent as to which keys this happens on?  try turning on
>> debug logging and seeing where the latency is coming from.
>>
>> On Tue, Nov 10, 2009 at 1:53 PM, Chris Were <ch...@gmail.com> wrote:
>> >
>> > On Tue, Nov 10, 2009 at 11:50 AM, Jonathan Ellis <jb...@gmail.com>
>> > wrote:
>> >>
>> >> On Tue, Nov 10, 2009 at 1:49 PM, Chris Were <ch...@gmail.com>
>> >> wrote:
>> >> > Maybe... but it's not just multigets, it also happens when retreiving
>> >> > one
>> >> > row with get_slice.
>> >>
>> >> how many of the 3M columns are you trying to slice at once?
>> >
>> > Sorry, I must have mixed up the terminology.
>> > There's ~3M keys, but less than 10 columns in each. The get_slice calls
>> > are
>> > to retreive all the columns (10) for a given key.
>>
>>
>

Re: Timeout Exception

Posted by Chris Were <ch...@gmail.com>.

Hi Tim,

Thanks for the great pointers.

si, so are regularly in the 100-2000 range. I'll need to Google more about
what these mean etc, but are you effectively saying to tell cassandra to use
less memory? Cassandra is the only Java App running on the server.

Cheers,
Chris

On Mon, Nov 16, 2009 at 9:59 AM, Freeman, Tim <ti...@hp.com> wrote:

>  I'm running 0.4.1.  I used to get timeouts, then I changed my timeout
> from 5 seconds to 30 seconds and I get no more timeouts.  The relevant line
> from storage-conf.xml is:
>
>
>
>   <RpcTimeoutInMillis>30000</RpcTimeoutInMillis>
>
>
>
> The maximum latency is often just over 5 seconds in the worst case when I
> fetch thousands of records, so default timeout of 5 seconds happens to be a
> little bit too low for me.  My records are ~100Kbytes each.  You may get
> different results if your records are much larger or much smaller.
>
>
>
> The other issue I was having a few days ago was that the machine was page
> faulting so garbage collections were taking forever.  Some GC's took 20
> minutes in another Java process.  I didn't have verbose:gc turned on in
> Cassandra so I'm not sure what the score was there, but there's little
> reason to expect it to be qualitatively better, since it's pretty random
> which process gets some of its pages swapped out.  On a Linux machine, run
> "vmstat 5" when your machine is loaded and if you see numbers greater than 0
> in the "si" and "so" columns in rows after the first, tell one of your Java
> processes to take less memory.
>
>
>
> Tim Freeman
> Email: tim.freeman@hp.com
> Desk in Palo Alto: (650) 857-2581
> Home: (408) 774-1298
> Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and
> Thursday; call my desk instead.)
>
>
>
> *From:* Chris Were [mailto:chris.were@gmail.com]
> *Sent:* Monday, November 16, 2009 9:47 AM
> *To:* Jonathan Ellis
> *Cc:* cassandra-user@incubator.apache.org
> *Subject:* Re: Timeout Exception
>
>
>
> I turned on debug logging for a few days and timeouts happened across
> pretty much all requests. I couldn't see any particular request that was
> consistently the problem.
>
>
>
> After some experimenting it seems that shutting down cassandra and
> restarting resolves the problem. Once it hits the JVM memory limit however,
> the timeouts start again. I have read the page on MemTable thresholds and
> have tried thresholds of 32MB, 64MB and 128MB with no noticeable difference.
> Cassandra is set to use 7GB of memory. I have 12 CF's, however only 6 of
> those have lots of data.
>
>
>
> Cheers,
>
> Chris
>
> On Tue, Nov 10, 2009 at 11:55 AM, Jonathan Ellis <jb...@gmail.com>
> wrote:
>
> if you're timing out doing a slice on 10 columns w/ 10% cpu used,
> something is broken
>
> is it consistent as to which keys this happens on?  try turning on
> debug logging and seeing where the latency is coming from.
>
>
> On Tue, Nov 10, 2009 at 1:53 PM, Chris Were <ch...@gmail.com> wrote:
> >
> > On Tue, Nov 10, 2009 at 11:50 AM, Jonathan Ellis <jb...@gmail.com>
> wrote:
> >>
> >> On Tue, Nov 10, 2009 at 1:49 PM, Chris Were <ch...@gmail.com>
> wrote:
> >> > Maybe... but it's not just multigets, it also happens when retreiving
> >> > one
> >> > row with get_slice.
> >>
> >> how many of the 3M columns are you trying to slice at once?
> >
> > Sorry, I must have mixed up the terminology.
> > There's ~3M keys, but less than 10 columns in each. The get_slice calls
> are
> > to retreive all the columns (10) for a given key.
>
>
>

RE: Timeout Exception

Posted by "Freeman, Tim" <ti...@hp.com>.

I'm running 0.4.1.  I used to get timeouts, then I changed my timeout from 5 seconds to 30 seconds and I get no more timeouts.  The relevant line from storage-conf.xml is:

  <RpcTimeoutInMillis>30000</RpcTimeoutInMillis>

The maximum latency is often just over 5 seconds in the worst case when I fetch thousands of records, so default timeout of 5 seconds happens to be a little bit too low for me.  My records are ~100Kbytes each.  You may get different results if your records are much larger or much smaller.

The other issue I was having a few days ago was that the machine was page faulting so garbage collections were taking forever.  Some GC's took 20 minutes in another Java process.  I didn't have verbose:gc turned on in Cassandra so I'm not sure what the score was there, but there's little reason to expect it to be qualitatively better, since it's pretty random which process gets some of its pages swapped out.  On a Linux machine, run "vmstat 5" when your machine is loaded and if you see numbers greater than 0 in the "si" and "so" columns in rows after the first, tell one of your Java processes to take less memory.

Tim Freeman
Email: tim.freeman@hp.com<ma...@hp.com>
Desk in Palo Alto: (650) 857-2581
Home: (408) 774-1298
Cell: (408) 348-7536 (No reception business hours Monday, Tuesday, and Thursday; call my desk instead.)

From: Chris Were [mailto:chris.were@gmail.com]
Sent: Monday, November 16, 2009 9:47 AM
To: Jonathan Ellis
Cc: cassandra-user@incubator.apache.org
Subject: Re: Timeout Exception

I turned on debug logging for a few days and timeouts happened across pretty much all requests. I couldn't see any particular request that was consistently the problem.

After some experimenting it seems that shutting down cassandra and restarting resolves the problem. Once it hits the JVM memory limit however, the timeouts start again. I have read the page on MemTable thresholds and have tried thresholds of 32MB, 64MB and 128MB with no noticeable difference. Cassandra is set to use 7GB of memory. I have 12 CF's, however only 6 of those have lots of data.

Cheers,
Chris
On Tue, Nov 10, 2009 at 11:55 AM, Jonathan Ellis <jb...@gmail.com>> wrote:
if you're timing out doing a slice on 10 columns w/ 10% cpu used,
something is broken

is it consistent as to which keys this happens on?  try turning on
debug logging and seeing where the latency is coming from.

On Tue, Nov 10, 2009 at 1:53 PM, Chris Were <ch...@gmail.com>> wrote:
>
> On Tue, Nov 10, 2009 at 11:50 AM, Jonathan Ellis <jb...@gmail.com>> wrote:
>>
>> On Tue, Nov 10, 2009 at 1:49 PM, Chris Were <ch...@gmail.com>> wrote:
>> > Maybe... but it's not just multigets, it also happens when retreiving
>> > one
>> > row with get_slice.
>>
>> how many of the 3M columns are you trying to slice at once?
>
> Sorry, I must have mixed up the terminology.
> There's ~3M keys, but less than 10 columns in each. The get_slice calls are
> to retreive all the columns (10) for a given key.

Re: Timeout Exception

Posted by Chris Were <ch...@gmail.com>.

I turned on debug logging for a few days and timeouts happened across pretty
much all requests. I couldn't see any particular request that was
consistently the problem.

After some experimenting it seems that shutting down cassandra and
restarting resolves the problem. Once it hits the JVM memory limit however,
the timeouts start again. I have read the page on MemTable thresholds and
have tried thresholds of 32MB, 64MB and 128MB with no noticeable difference.
Cassandra is set to use 7GB of memory. I have 12 CF's, however only 6 of
those have lots of data.

Cheers,
Chris

On Tue, Nov 10, 2009 at 11:55 AM, Jonathan Ellis <jb...@gmail.com> wrote:

> if you're timing out doing a slice on 10 columns w/ 10% cpu used,
> something is broken
>
> is it consistent as to which keys this happens on?  try turning on
> debug logging and seeing where the latency is coming from.
>
> On Tue, Nov 10, 2009 at 1:53 PM, Chris Were <ch...@gmail.com> wrote:
> >
> > On Tue, Nov 10, 2009 at 11:50 AM, Jonathan Ellis <jb...@gmail.com>
> wrote:
> >>
> >> On Tue, Nov 10, 2009 at 1:49 PM, Chris Were <ch...@gmail.com>
> wrote:
> >> > Maybe... but it's not just multigets, it also happens when retreiving
> >> > one
> >> > row with get_slice.
> >>
> >> how many of the 3M columns are you trying to slice at once?
> >
> > Sorry, I must have mixed up the terminology.
> > There's ~3M keys, but less than 10 columns in each. The get_slice calls
> are
> > to retreive all the columns (10) for a given key.
>

Re: Timeout Exception

Posted by Jonathan Ellis <jb...@gmail.com>.

if you're timing out doing a slice on 10 columns w/ 10% cpu used,
something is broken

is it consistent as to which keys this happens on?  try turning on
debug logging and seeing where the latency is coming from.

On Tue, Nov 10, 2009 at 1:53 PM, Chris Were <ch...@gmail.com> wrote:
>
> On Tue, Nov 10, 2009 at 11:50 AM, Jonathan Ellis <jb...@gmail.com> wrote:
>>
>> On Tue, Nov 10, 2009 at 1:49 PM, Chris Were <ch...@gmail.com> wrote:
>> > Maybe... but it's not just multigets, it also happens when retreiving
>> > one
>> > row with get_slice.
>>
>> how many of the 3M columns are you trying to slice at once?
>
> Sorry, I must have mixed up the terminology.
> There's ~3M keys, but less than 10 columns in each. The get_slice calls are
> to retreive all the columns (10) for a given key.

Re: Timeout Exception

Posted by Chris Were <ch...@gmail.com>.

On Tue, Nov 10, 2009 at 11:50 AM, Jonathan Ellis <jb...@gmail.com> wrote:

> On Tue, Nov 10, 2009 at 1:49 PM, Chris Were <ch...@gmail.com> wrote:
> > Maybe... but it's not just multigets, it also happens when retreiving one
> > row with get_slice.
>
> how many of the 3M columns are you trying to slice at once?
>

Sorry, I must have mixed up the terminology.

There's ~3M keys, but less than 10 columns in each. The get_slice calls are
to retreive all the columns (10) for a given key.

Re: Timeout Exception

Posted by Jonathan Ellis <jb...@gmail.com>.

On Tue, Nov 10, 2009 at 1:49 PM, Chris Were <ch...@gmail.com> wrote:
> Maybe... but it's not just multigets, it also happens when retreiving one
> row with get_slice.

how many of the 3M columns are you trying to slice at once?

Re: Timeout Exception

Posted by Chris Were <ch...@gmail.com>.

On Tue, Nov 10, 2009 at 11:34 AM, Jonathan Ellis <jb...@gmail.com> wrote:

> On Tue, Nov 10, 2009 at 1:23 PM, Chris Were <ch...@gmail.com> wrote:
> > It seems odd that the error sometimes says received 1 response, but it
> still
> > times out, as I only have one node.
>
> if it gets a response between when it decides to time out and when it
> logs the message, that could happen.  or does it happen frequently?
>

Approximately half the timeouts show they received 1 response but timed out.


 > As for load, CPU usage is certainly not a bottleneck.
>
> then maybe goffinet's parallelizing multigets to the local node
> (coming soon) will be all you need...
>

Maybe... but it's not just multigets, it also happens when retreiving one
row with get_slice.

Perhaps there is too much data for this one node, but tbh something just
doesn't feel right :)

Chris.

Re: Timeout Exception

Posted by Jonathan Ellis <jb...@gmail.com>.

On Tue, Nov 10, 2009 at 1:23 PM, Chris Were <ch...@gmail.com> wrote:
> It seems odd that the error sometimes says received 1 response, but it still
> times out, as I only have one node.

if it gets a response between when it decides to time out and when it
logs the message, that could happen.  or does it happen frequently?

> As for load, CPU usage is certainly not a bottleneck.

then maybe goffinet's parallelizing multigets to the local node
(coming soon) will be all you need...

-Jonathan

Re: Timeout Exception

Posted by Chris Were <ch...@gmail.com>.

There's no error on the source node other than the Timeout.
It appears to be occurring across multiple CF's (the majority of which are
normal columns).
I don't know an exact number but some of the CF's would have ~3million rows.
It seems odd that the error sometimes says received 1 response, but it still
times out, as I only have one node.
As for load, CPU usage is certainly not a bottleneck.
"top" consistently shows ~ 10-20% waiting,

Chris.

On Mon, Nov 9, 2009 at 9:22 PM, Jonathan Ellis <jb...@gmail.com> wrote:

> What's causing the timeout?  An error on the source node, or just
> slowness?  If the latter, how many rows are in your multiget?
>
> On Mon, Nov 9, 2009 at 10:25 PM, Chris Were <ch...@gmail.com> wrote:
> >
> > I'm getting a Timeout Exception every now and again (currently every
> couple
> > of minutes or so).
> > Using revision 833288. Quorum set to ONE. My cassandra instance has been
> > running for two days and the data directory is around 16GB. I'm not sure
> > what the problem is, but let me know of any tests I can do to help reduce
> > the problem further. There are two variations on the exception, I have
> > pasted them both below.
> > ERROR [pool-1-thread-63] 2009-11-09 20:17:27,579 Cassandra.java (line
> > org.apache.cassandra.service.Cassandra$Processor) Internal error
> processing
> > get_slice
> > java.lang.RuntimeException: java.util.concurrent.TimeoutException:
> Operation
> > timed out - received only 0 responses from  .
> > at
> >
> org.apache.cassandra.service.CassandraServer.readColumnFamily(CassandraServer.java:103)
> > at
> >
> org.apache.cassandra.service.CassandraServer.getSlice(CassandraServer.java:177)
> > at
> >
> org.apache.cassandra.service.CassandraServer.multigetSliceInternal(CassandraServer.java:252)
> > at
> >
> org.apache.cassandra.service.CassandraServer.get_slice(CassandraServer.java:215)
> > at
> >
> org.apache.cassandra.service.Cassandra$Processor$get_slice.process(Cassandra.java:668)
> > at
> >
> org.apache.cassandra.service.Cassandra$Processor.process(Cassandra.java:624)
> > at
> >
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> > at java.lang.Thread.run(Thread.java:636)
> > Caused by: java.util.concurrent.TimeoutException: Operation timed out -
> > received only 0 responses from  .
> > at
> >
> org.apache.cassandra.service.QuorumResponseHandler.get(QuorumResponseHandler.java:79)
> > at
> >
> org.apache.cassandra.service.StorageProxy.strongRead(StorageProxy.java:408)
> > at
> >
> org.apache.cassandra.service.StorageProxy.readProtocol(StorageProxy.java:333)
> > at
> >
> org.apache.cassandra.service.CassandraServer.readColumnFamily(CassandraServer.java:95)
> > ... 9 more
> > ERROR [pool-1-thread-19] 2009-11-09 11:29:18,731 Cassandra.java (line
> > org.apache.cassandra.service.Cassandra$Processor) Internal error
> processing
> > get_slice
> > java.lang.RuntimeException: java.util.concurrent.TimeoutException:
> Operation
> > timed out - received only 1 responses from /10.121.217.5 .
> >         at
> >
> org.apache.cassandra.service.CassandraServer.readColumnFamily(CassandraServer.java:103)
> >         at
> >
> org.apache.cassandra.service.CassandraServer.getSlice(CassandraServer.java:177)
> >         at
> >
> org.apache.cassandra.service.CassandraServer.multigetSliceInternal(CassandraServer.java:252)
> >         at
> >
> org.apache.cassandra.service.CassandraServer.get_slice(CassandraServer.java:215)
> >         at
> >
> org.apache.cassandra.service.Cassandra$Processor$get_slice.process(Cassandra.java:668)
> >         at
> >
> org.apache.cassandra.service.Cassandra$Processor.process(Cassandra.java:624)
> >         at
> >
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253)
> >         at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> >         at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> >         at java.lang.Thread.run(Thread.java:636)
> > Caused by: java.util.concurrent.TimeoutException: Operation timed out -
> > received only 1 responses from /10.121.217.5 .
> >         at
> >
> org.apache.cassandra.service.QuorumResponseHandler.get(QuorumResponseHandler.java:79)
> >         at
> >
> org.apache.cassandra.service.StorageProxy.strongRead(StorageProxy.java:408)
> >         at
> >
> org.apache.cassandra.service.StorageProxy.readProtocol(StorageProxy.java:333)
> >         at
> >
> org.apache.cassandra.service.CassandraServer.readColumnFamily(CassandraServer.java:95)
> >         ... 9 more
> > Cheers,
> > Chris
>

Re: Timeout Exception

Posted by Jonathan Ellis <jb...@gmail.com>.

What's causing the timeout?  An error on the source node, or just
slowness?  If the latter, how many rows are in your multiget?

On Mon, Nov 9, 2009 at 10:25 PM, Chris Were <ch...@gmail.com> wrote:
>
> I'm getting a Timeout Exception every now and again (currently every couple
> of minutes or so).
> Using revision 833288. Quorum set to ONE. My cassandra instance has been
> running for two days and the data directory is around 16GB. I'm not sure
> what the problem is, but let me know of any tests I can do to help reduce
> the problem further. There are two variations on the exception, I have
> pasted them both below.
> ERROR [pool-1-thread-63] 2009-11-09 20:17:27,579 Cassandra.java (line
> org.apache.cassandra.service.Cassandra$Processor) Internal error processing
> get_slice
> java.lang.RuntimeException: java.util.concurrent.TimeoutException: Operation
> timed out - received only 0 responses from  .
> at
> org.apache.cassandra.service.CassandraServer.readColumnFamily(CassandraServer.java:103)
> at
> org.apache.cassandra.service.CassandraServer.getSlice(CassandraServer.java:177)
> at
> org.apache.cassandra.service.CassandraServer.multigetSliceInternal(CassandraServer.java:252)
> at
> org.apache.cassandra.service.CassandraServer.get_slice(CassandraServer.java:215)
> at
> org.apache.cassandra.service.Cassandra$Processor$get_slice.process(Cassandra.java:668)
> at
> org.apache.cassandra.service.Cassandra$Processor.process(Cassandra.java:624)
> at
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:636)
> Caused by: java.util.concurrent.TimeoutException: Operation timed out -
> received only 0 responses from  .
> at
> org.apache.cassandra.service.QuorumResponseHandler.get(QuorumResponseHandler.java:79)
> at
> org.apache.cassandra.service.StorageProxy.strongRead(StorageProxy.java:408)
> at
> org.apache.cassandra.service.StorageProxy.readProtocol(StorageProxy.java:333)
> at
> org.apache.cassandra.service.CassandraServer.readColumnFamily(CassandraServer.java:95)
> ... 9 more
> ERROR [pool-1-thread-19] 2009-11-09 11:29:18,731 Cassandra.java (line
> org.apache.cassandra.service.Cassandra$Processor) Internal error processing
> get_slice
> java.lang.RuntimeException: java.util.concurrent.TimeoutException: Operation
> timed out - received only 1 responses from /10.121.217.5 .
>         at
> org.apache.cassandra.service.CassandraServer.readColumnFamily(CassandraServer.java:103)
>         at
> org.apache.cassandra.service.CassandraServer.getSlice(CassandraServer.java:177)
>         at
> org.apache.cassandra.service.CassandraServer.multigetSliceInternal(CassandraServer.java:252)
>         at
> org.apache.cassandra.service.CassandraServer.get_slice(CassandraServer.java:215)
>         at
> org.apache.cassandra.service.Cassandra$Processor$get_slice.process(Cassandra.java:668)
>         at
> org.apache.cassandra.service.Cassandra$Processor.process(Cassandra.java:624)
>         at
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>         at java.lang.Thread.run(Thread.java:636)
> Caused by: java.util.concurrent.TimeoutException: Operation timed out -
> received only 1 responses from /10.121.217.5 .
>         at
> org.apache.cassandra.service.QuorumResponseHandler.get(QuorumResponseHandler.java:79)
>         at
> org.apache.cassandra.service.StorageProxy.strongRead(StorageProxy.java:408)
>         at
> org.apache.cassandra.service.StorageProxy.readProtocol(StorageProxy.java:333)
>         at
> org.apache.cassandra.service.CassandraServer.readColumnFamily(CassandraServer.java:95)
>         ... 9 more
> Cheers,
> Chris

Re: Timeout Exception

Posted by Chris Were <ch...@gmail.com>.

I've only just got to looking at this again... which config file value are
you referring to here?

Chris.

On Mon, Nov 9, 2009 at 9:09 PM, Igor Katkov <ik...@gmail.com> wrote:

> I experienced this - it's might be GC or compaction process. Try to
> increase timeout value in the config file. It's not a fix, but at least
> something...
>
>
> On Mon, Nov 9, 2009 at 11:25 PM, Chris Were <ch...@gmail.com> wrote:
>
>>
>> I'm getting a Timeout Exception every now and again (currently every
>> couple of minutes or so).
>>
>> Using revision 833288. Quorum set to ONE. My cassandra instance has been
>> running for two days and the data directory is around 16GB. I'm not sure
>> what the problem is, but let me know of any tests I can do to help reduce
>> the problem further. There are two variations on the exception, I have
>> pasted them both below.
>>
>> ERROR [pool-1-thread-63] 2009-11-09 20:17:27,579 Cassandra.java (line
>> org.apache.cassandra.service.Cassandra$Processor) Internal error processing
>> get_slice
>> java.lang.RuntimeException: java.util.concurrent.TimeoutException:
>> Operation timed out - received only 0 responses from  .
>>  at
>> org.apache.cassandra.service.CassandraServer.readColumnFamily(CassandraServer.java:103)
>> at
>> org.apache.cassandra.service.CassandraServer.getSlice(CassandraServer.java:177)
>>  at
>> org.apache.cassandra.service.CassandraServer.multigetSliceInternal(CassandraServer.java:252)
>> at
>> org.apache.cassandra.service.CassandraServer.get_slice(CassandraServer.java:215)
>>  at
>> org.apache.cassandra.service.Cassandra$Processor$get_slice.process(Cassandra.java:668)
>> at
>> org.apache.cassandra.service.Cassandra$Processor.process(Cassandra.java:624)
>>  at
>> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>>  at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>> at java.lang.Thread.run(Thread.java:636)
>> Caused by: java.util.concurrent.TimeoutException: Operation timed out -
>> received only 0 responses from  .
>> at
>> org.apache.cassandra.service.QuorumResponseHandler.get(QuorumResponseHandler.java:79)
>>  at
>> org.apache.cassandra.service.StorageProxy.strongRead(StorageProxy.java:408)
>> at
>> org.apache.cassandra.service.StorageProxy.readProtocol(StorageProxy.java:333)
>>  at
>> org.apache.cassandra.service.CassandraServer.readColumnFamily(CassandraServer.java:95)
>> ... 9 more
>>
>> ERROR [pool-1-thread-19] 2009-11-09 11:29:18,731 Cassandra.java (line
>> org.apache.cassandra.service.Cassandra$Processor) Internal error processing
>> get_slice
>> java.lang.RuntimeException: java.util.concurrent.TimeoutException:
>> Operation timed out - received only 1 responses from /10.121.217.5 .
>>         at
>> org.apache.cassandra.service.CassandraServer.readColumnFamily(CassandraServer.java:103)
>>         at
>> org.apache.cassandra.service.CassandraServer.getSlice(CassandraServer.java:177)
>>         at
>> org.apache.cassandra.service.CassandraServer.multigetSliceInternal(CassandraServer.java:252)
>>         at
>> org.apache.cassandra.service.CassandraServer.get_slice(CassandraServer.java:215)
>>         at
>> org.apache.cassandra.service.Cassandra$Processor$get_slice.process(Cassandra.java:668)
>>         at
>> org.apache.cassandra.service.Cassandra$Processor.process(Cassandra.java:624)
>>         at
>> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253)
>>         at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>>         at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>>         at java.lang.Thread.run(Thread.java:636)
>> Caused by: java.util.concurrent.TimeoutException: Operation timed out -
>> received only 1 responses from /10.121.217.5 .
>>         at
>> org.apache.cassandra.service.QuorumResponseHandler.get(QuorumResponseHandler.java:79)
>>         at
>> org.apache.cassandra.service.StorageProxy.strongRead(StorageProxy.java:408)
>>         at
>> org.apache.cassandra.service.StorageProxy.readProtocol(StorageProxy.java:333)
>>         at
>> org.apache.cassandra.service.CassandraServer.readColumnFamily(CassandraServer.java:95)
>>         ... 9 more
>>
>> Cheers,
>> Chris
>>
>
>

Re: Timeout Exception

Posted by Igor Katkov <ik...@gmail.com>.

I experienced this - it's might be GC or compaction process. Try to increase
timeout value in the config file. It's not a fix, but at least something...

On Mon, Nov 9, 2009 at 11:25 PM, Chris Were <ch...@gmail.com> wrote:

>
> I'm getting a Timeout Exception every now and again (currently every couple
> of minutes or so).
>
> Using revision 833288. Quorum set to ONE. My cassandra instance has been
> running for two days and the data directory is around 16GB. I'm not sure
> what the problem is, but let me know of any tests I can do to help reduce
> the problem further. There are two variations on the exception, I have
> pasted them both below.
>
> ERROR [pool-1-thread-63] 2009-11-09 20:17:27,579 Cassandra.java (line
> org.apache.cassandra.service.Cassandra$Processor) Internal error processing
> get_slice
> java.lang.RuntimeException: java.util.concurrent.TimeoutException:
> Operation timed out - received only 0 responses from  .
>  at
> org.apache.cassandra.service.CassandraServer.readColumnFamily(CassandraServer.java:103)
> at
> org.apache.cassandra.service.CassandraServer.getSlice(CassandraServer.java:177)
>  at
> org.apache.cassandra.service.CassandraServer.multigetSliceInternal(CassandraServer.java:252)
> at
> org.apache.cassandra.service.CassandraServer.get_slice(CassandraServer.java:215)
>  at
> org.apache.cassandra.service.Cassandra$Processor$get_slice.process(Cassandra.java:668)
> at
> org.apache.cassandra.service.Cassandra$Processor.process(Cassandra.java:624)
>  at
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>  at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:636)
> Caused by: java.util.concurrent.TimeoutException: Operation timed out -
> received only 0 responses from  .
> at
> org.apache.cassandra.service.QuorumResponseHandler.get(QuorumResponseHandler.java:79)
>  at
> org.apache.cassandra.service.StorageProxy.strongRead(StorageProxy.java:408)
> at
> org.apache.cassandra.service.StorageProxy.readProtocol(StorageProxy.java:333)
>  at
> org.apache.cassandra.service.CassandraServer.readColumnFamily(CassandraServer.java:95)
> ... 9 more
>
> ERROR [pool-1-thread-19] 2009-11-09 11:29:18,731 Cassandra.java (line
> org.apache.cassandra.service.Cassandra$Processor) Internal error processing
> get_slice
> java.lang.RuntimeException: java.util.concurrent.TimeoutException:
> Operation timed out - received only 1 responses from /10.121.217.5 .
>         at
> org.apache.cassandra.service.CassandraServer.readColumnFamily(CassandraServer.java:103)
>         at
> org.apache.cassandra.service.CassandraServer.getSlice(CassandraServer.java:177)
>         at
> org.apache.cassandra.service.CassandraServer.multigetSliceInternal(CassandraServer.java:252)
>         at
> org.apache.cassandra.service.CassandraServer.get_slice(CassandraServer.java:215)
>         at
> org.apache.cassandra.service.Cassandra$Processor$get_slice.process(Cassandra.java:668)
>         at
> org.apache.cassandra.service.Cassandra$Processor.process(Cassandra.java:624)
>         at
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:253)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>         at java.lang.Thread.run(Thread.java:636)
> Caused by: java.util.concurrent.TimeoutException: Operation timed out -
> received only 1 responses from /10.121.217.5 .
>         at
> org.apache.cassandra.service.QuorumResponseHandler.get(QuorumResponseHandler.java:79)
>         at
> org.apache.cassandra.service.StorageProxy.strongRead(StorageProxy.java:408)
>         at
> org.apache.cassandra.service.StorageProxy.readProtocol(StorageProxy.java:333)
>         at
> org.apache.cassandra.service.CassandraServer.readColumnFamily(CassandraServer.java:95)
>         ... 9 more
>
> Cheers,
> Chris
>