You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Chap Lovejoy <ch...@gmail.com> on 2014/02/05 23:14:27 UTC

Periodic rpc_timeout errors on select query

Hi,

We're seeing pretty regular rpc timeout errors on what appear to be 
simple queries. We're running a three node cluster under pretty light 
load. We're averaging 30-40 writes/sec and about 8 reads/sec according 
to OpsCenter. The failures don't seem to be related to any changes in 
load. A single query repeated from CQLSH (about once a second or so) 
will fail approximately one out of ten times. I do see an increase in 
the average read latency around the time of the failure, though it's 
unclear if that's from the single failed request or if others are 
affected. This seems to happen most on a number of similarly structured 
tables. One is:

CREATE TABLE psr (
   inst_id bigint,
   prosp_id bigint,
   inter_id bigint,
   avail text,
   comments text,
   email text,
   first_name text,
   last_name text,
   m_id text,
   m_num text,
   phone text,
   info blob,
   status text,
   time timestamp,
   PRIMARY KEY ((inst_id, prosp_id), inter_id)
) WITH CLUSTERING ORDER BY (inter_id DESC) AND
   bloom_filter_fp_chance=0.010000 AND
   caching='KEYS_ONLY' AND
   comment='' AND
   dclocal_read_repair_chance=0.000000 AND
   gc_grace_seconds=864000 AND
   index_interval=128 AND
   read_repair_chance=0.100000 AND
   replicate_on_write='true' AND
   populate_io_cache_on_flush='false' AND
   default_time_to_live=0 AND
   speculative_retry='99.0PERCENTILE' AND
   memtable_flush_period_in_ms=0 AND
   compaction={'class': 'SizeTieredCompactionStrategy'} AND
   compression={'sstable_compression': 'LZ4Compressor'};

I'm executing the query:
SELECT inter_id FROM "psr" WHERE inst_id = 1 AND prosp_id = 
127788649174986752 AND inter_id < 30273563814527279 LIMIT 10000;

Normally this query returns 32 rows. A total of 413 match the partition 
key.

Here is a trace for a successful run:
                                                   | timestamp    | 
source        | source_elapsed
--------------------------------------------------+--------------+---------------+----------------
                                execute_cql3_query | 21:52:20,831 | 
10.128.32.141 |              0
              Message received from /10.128.32.141 | 21:52:20,826 | 
10.128.32.140 |             69
           Executing single-partition query on psr | 21:52:20,827 | 
10.128.32.140 |            502
                      Acquiring sstable references | 21:52:20,827 | 
10.128.32.140 |            517
                       Merging memtable tombstones | 21:52:20,827 | 
10.128.32.140 |            576
                      Key cache hit for sstable 54 | 21:52:20,827 | 
10.128.32.140 |            685
Seeking to partition indexed section in data file | 21:52:20,827 | 
10.128.32.140 |            697
     Skipped 1/2 non-slice-intersecting sstables,  | 21:52:20,827 | 
10.128.32.140 |            751
     included 0 due to tombstones
        Merging data from memtables and 1 sstables | 21:52:20,827 | 
10.128.32.140 |            773
               Read 32 live and 0 tombstoned cells | 21:52:20,828 | 
10.128.32.140 |           2055
              Enqueuing response to /10.128.32.141 | 21:52:20,829 | 
10.128.32.140 |           2172
                 Sending message to /10.128.32.141 | 21:52:20,829 | 
10.128.32.140 |           2341
                               Parsing SELECT ...  | 21:52:20,831 | 
10.128.32.141 |            105
                               Preparing statement | 21:52:20,831 | 
10.128.32.141 |            200
                 Sending message to /10.128.32.140 | 21:52:20,831 | 
10.128.32.141 |            492
              Message received from /10.128.32.140 | 21:52:20,836 | 
10.128.32.141 |           5361
           Processing response from /10.128.32.140 | 21:52:20,836 | 
10.128.32.141 |           5534
                                  Request complete | 21:52:20,837 | 
10.128.32.141 |           6013


And here is on unsuccessful run:

                                                    | timestamp    | 
source        | source_elapsed
---------------------------------------------------+--------------+---------------+---------------
                                 execute_cql3_query | 21:56:19,792 | 
10.128.32.141 |              0
                                Parsing SELECT ...  | 21:56:19,792 | 
10.128.32.141 |             69
                                Preparing statement | 21:56:19,792 | 
10.128.32.141 |            160
                  Sending message to /10.128.32.137 | 21:56:19,792 | 
10.128.32.141 |            509
               Message received from /10.128.32.141 | 21:56:19,793 | 
10.128.32.137 |             57
            Executing single-partition query on psr | 21:56:19,794 | 
10.128.32.137 |            412
                       Acquiring sstable references | 21:56:19,794 | 
10.128.32.137 |            444
                        Merging memtable tombstones | 21:56:19,794 | 
10.128.32.137 |            486
                       Key cache hit for sstable 59 | 21:56:19,794 | 
10.128.32.137 |            555
  Seeking to partition indexed section in data file | 21:56:19,794 | 
10.128.32.137 |            569
                        Key cache hit for sstable 3 | 21:56:19,794 | 
10.128.32.137 |            609
  Seeking to partition indexed section in data file | 21:56:19,794 | 
10.128.32.137 |            621
               Timed out; received 0 of 1 responses | 21:56:24,792 | 
10.128.32.141 |        5000766
                                   Request complete | 21:56:24,792 | 
10.128.32.141 |        5000888

The output of cfstats is:

                 SSTable count: 3
                 Space used (live), bytes: 606268
                 Space used (total), bytes: 612372
                 SSTable Compression Ratio: 0.4011882905126778
                 Number of keys (estimate): 768
                 Memtable cell count: 156
                 Memtable data size, bytes: 60451
                 Memtable switch count: 34
                 Local read count: 1439
                 Local read latency: NaN ms
                 Local write count: 618
                 Local write latency: NaN ms
                 Pending tasks: 0
                 Bloom filter false positives: 0
                 Bloom filter false ratio: 0.00000
                 Bloom filter space used, bytes: 1840
                 Compacted partition minimum bytes: 447
                 Compacted partition maximum bytes: 20501
                 Compacted partition mean bytes: 3478
                 Average live cells per slice (last five minutes): 0.0
                 Average tombstones per slice (last five minutes): 0.0

Is there a problem with the table structure or query that would be 
causing these failures? Should we expect timeouts as a regular 
occurrence in operation and be prepared to retry every query as needed?

I'd really appreciate any input you could offer on how to improve this. 
These timeouts are causing some rather troublesome errors in our 
application.

Thank you,
Chap

Re: Periodic rpc_timeout errors on select query

Posted by Chap Lovejoy <ch...@gmail.com>.
Hi Steve,

It looks like it will be pretty easy for us to do some testing with the 
new client version. I'm going to give it a shot and keep my fingers 
crossed.

Thanks again,
Chap

On 5 Feb 2014, at 18:10, Steven A Robenalt wrote:

> Hi Chap,
>
> If you have the ability to test the 2.0.0rc2 driver, I would recommend
> doing so, even from a dedicated test client or a JUnit test case. 
> There are
> other benefits to the change, such as being able to use 
> BatchStatements,
> aside from possible impact on your read timeouts.
>
> Steve

Re: Periodic rpc_timeout errors on select query

Posted by Steven A Robenalt <sr...@stanford.edu>.
Hi Chap,

If you have the ability to test the 2.0.0rc2 driver, I would recommend
doing so, even from a dedicated test client or a JUnit test case. There are
other benefits to the change, such as being able to use BatchStatements,
aside from possible impact on your read timeouts.

Steve



On Wed, Feb 5, 2014 at 3:06 PM, Chap Lovejoy <ch...@gmail.com> wrote:

> Hi Steve,
>
> Thanks for the reply. After all that information in my initial message I
> would forget one of the most important bits. We're running Cassandra 2.0.3
> with the 1.0.4 version of the DataStax driver.  I'd seen mention of those
> timeouts under earlier 2.x versions and really hoped they were the source
> of our problem but unfortunately that doesn't seem to be the case.
>
> Thanks again,
> Chap
>
>
> On 5 Feb 2014, at 17:49, Steven A Robenalt wrote:
>
>  Hi Chap,
>>
>> You don't indicate which version of Cassandra and what client side driver
>> you are using, but I have seen the same behavior with Cassandra 2.0.2 and
>> earlier versions of the Java Driver. With Cassandra 2.0.3 and the 2.0.0rc2
>> driver, my read timeouts are basically nonexistent at my current load
>> levels.
>>
>> Not sure how this applies if you're still on 1.x versions of Cassandra
>> since we moved off of that branch a few months ago. Ditto for the client
>> driver if you're using something other than the Java Driver, or the 1.x
>> version of same. Our problems were due to changes specific to the 2.x
>> versions only.
>>
>> Steve
>>
>


-- 
Steve Robenalt
Software Architect
HighWire | Stanford University
425 Broadway St, Redwood City, CA 94063

srobenal@stanford.edu
http://highwire.stanford.edu

Re: Periodic rpc_timeout errors on select query

Posted by Chap Lovejoy <ch...@gmail.com>.
Hi Steve,

Thanks for the reply. After all that information in my initial message I 
would forget one of the most important bits. We're running Cassandra 
2.0.3 with the 1.0.4 version of the DataStax driver.  I'd seen mention 
of those timeouts under earlier 2.x versions and really hoped they were 
the source of our problem but unfortunately that doesn't seem to be the 
case.

Thanks again,
Chap

On 5 Feb 2014, at 17:49, Steven A Robenalt wrote:

> Hi Chap,
>
> You don't indicate which version of Cassandra and what client side 
> driver
> you are using, but I have seen the same behavior with Cassandra 2.0.2 
> and
> earlier versions of the Java Driver. With Cassandra 2.0.3 and the 
> 2.0.0rc2
> driver, my read timeouts are basically nonexistent at my current load
> levels.
>
> Not sure how this applies if you're still on 1.x versions of Cassandra
> since we moved off of that branch a few months ago. Ditto for the 
> client
> driver if you're using something other than the Java Driver, or the 
> 1.x
> version of same. Our problems were due to changes specific to the 2.x
> versions only.
>
> Steve

Re: Periodic rpc_timeout errors on select query

Posted by Steven A Robenalt <sr...@stanford.edu>.
Hi Chap,

You don't indicate which version of Cassandra and what client side driver
you are using, but I have seen the same behavior with Cassandra 2.0.2 and
earlier versions of the Java Driver. With Cassandra 2.0.3 and the 2.0.0rc2
driver, my read timeouts are basically nonexistent at my current load
levels.

Not sure how this applies if you're still on 1.x versions of Cassandra
since we moved off of that branch a few months ago. Ditto for the client
driver if you're using something other than the Java Driver, or the 1.x
version of same. Our problems were due to changes specific to the 2.x
versions only.

Steve



On Wed, Feb 5, 2014 at 2:14 PM, Chap Lovejoy <ch...@gmail.com> wrote:

> Hi,
>
> We're seeing pretty regular rpc timeout errors on what appear to be simple
> queries. We're running a three node cluster under pretty light load. We're
> averaging 30-40 writes/sec and about 8 reads/sec according to OpsCenter.
> The failures don't seem to be related to any changes in load. A single
> query repeated from CQLSH (about once a second or so) will fail
> approximately one out of ten times. I do see an increase in the average
> read latency around the time of the failure, though it's unclear if that's
> from the single failed request or if others are affected. This seems to
> happen most on a number of similarly structured tables. One is:
>
> CREATE TABLE psr (
>   inst_id bigint,
>   prosp_id bigint,
>   inter_id bigint,
>   avail text,
>   comments text,
>   email text,
>   first_name text,
>   last_name text,
>   m_id text,
>   m_num text,
>   phone text,
>   info blob,
>   status text,
>   time timestamp,
>   PRIMARY KEY ((inst_id, prosp_id), inter_id)
> ) WITH CLUSTERING ORDER BY (inter_id DESC) AND
>   bloom_filter_fp_chance=0.010000 AND
>   caching='KEYS_ONLY' AND
>   comment='' AND
>   dclocal_read_repair_chance=0.000000 AND
>   gc_grace_seconds=864000 AND
>   index_interval=128 AND
>   read_repair_chance=0.100000 AND
>   replicate_on_write='true' AND
>   populate_io_cache_on_flush='false' AND
>   default_time_to_live=0 AND
>   speculative_retry='99.0PERCENTILE' AND
>   memtable_flush_period_in_ms=0 AND
>   compaction={'class': 'SizeTieredCompactionStrategy'} AND
>   compression={'sstable_compression': 'LZ4Compressor'};
>
> I'm executing the query:
> SELECT inter_id FROM "psr" WHERE inst_id = 1 AND prosp_id =
> 127788649174986752 AND inter_id < 30273563814527279 LIMIT 10000;
>
> Normally this query returns 32 rows. A total of 413 match the partition
> key.
>
> Here is a trace for a successful run:
>                                                   | timestamp    | source
>        | source_elapsed
> --------------------------------------------------+---------
> -----+---------------+----------------
>                                execute_cql3_query | 21:52:20,831 |
> 10.128.32.141 |              0
>              Message received from /10.128.32.141 | 21:52:20,826 |
> 10.128.32.140 |             69
>           Executing single-partition query on psr | 21:52:20,827 |
> 10.128.32.140 |            502
>                      Acquiring sstable references | 21:52:20,827 |
> 10.128.32.140 |            517
>                       Merging memtable tombstones | 21:52:20,827 |
> 10.128.32.140 |            576
>                      Key cache hit for sstable 54 | 21:52:20,827 |
> 10.128.32.140 |            685
> Seeking to partition indexed section in data file | 21:52:20,827 |
> 10.128.32.140 |            697
>     Skipped 1/2 non-slice-intersecting sstables,  | 21:52:20,827 |
> 10.128.32.140 |            751
>     included 0 due to tombstones
>        Merging data from memtables and 1 sstables | 21:52:20,827 |
> 10.128.32.140 |            773
>               Read 32 live and 0 tombstoned cells | 21:52:20,828 |
> 10.128.32.140 |           2055
>              Enqueuing response to /10.128.32.141 | 21:52:20,829 |
> 10.128.32.140 |           2172
>                 Sending message to /10.128.32.141 | 21:52:20,829 |
> 10.128.32.140 |           2341
>                               Parsing SELECT ...  | 21:52:20,831 |
> 10.128.32.141 |            105
>                               Preparing statement | 21:52:20,831 |
> 10.128.32.141 |            200
>                 Sending message to /10.128.32.140 | 21:52:20,831 |
> 10.128.32.141 |            492
>              Message received from /10.128.32.140 | 21:52:20,836 |
> 10.128.32.141 |           5361
>           Processing response from /10.128.32.140 | 21:52:20,836 |
> 10.128.32.141 |           5534
>                                  Request complete | 21:52:20,837 |
> 10.128.32.141 |           6013
>
>
> And here is on unsuccessful run:
>
>                                                    | timestamp    | source
>        | source_elapsed
> ---------------------------------------------------+--------
> ------+---------------+---------------
>                                 execute_cql3_query | 21:56:19,792 |
> 10.128.32.141 |              0
>                                Parsing SELECT ...  | 21:56:19,792 |
> 10.128.32.141 |             69
>                                Preparing statement | 21:56:19,792 |
> 10.128.32.141 |            160
>                  Sending message to /10.128.32.137 | 21:56:19,792 |
> 10.128.32.141 |            509
>               Message received from /10.128.32.141 | 21:56:19,793 |
> 10.128.32.137 |             57
>            Executing single-partition query on psr | 21:56:19,794 |
> 10.128.32.137 |            412
>                       Acquiring sstable references | 21:56:19,794 |
> 10.128.32.137 |            444
>                        Merging memtable tombstones | 21:56:19,794 |
> 10.128.32.137 |            486
>                       Key cache hit for sstable 59 | 21:56:19,794 |
> 10.128.32.137 |            555
>  Seeking to partition indexed section in data file | 21:56:19,794 |
> 10.128.32.137 |            569
>                        Key cache hit for sstable 3 | 21:56:19,794 |
> 10.128.32.137 |            609
>  Seeking to partition indexed section in data file | 21:56:19,794 |
> 10.128.32.137 |            621
>               Timed out; received 0 of 1 responses | 21:56:24,792 |
> 10.128.32.141 |        5000766
>                                   Request complete | 21:56:24,792 |
> 10.128.32.141 |        5000888
>
> The output of cfstats is:
>
>                 SSTable count: 3
>                 Space used (live), bytes: 606268
>                 Space used (total), bytes: 612372
>                 SSTable Compression Ratio: 0.4011882905126778
>                 Number of keys (estimate): 768
>                 Memtable cell count: 156
>                 Memtable data size, bytes: 60451
>                 Memtable switch count: 34
>                 Local read count: 1439
>                 Local read latency: NaN ms
>                 Local write count: 618
>                 Local write latency: NaN ms
>                 Pending tasks: 0
>                 Bloom filter false positives: 0
>                 Bloom filter false ratio: 0.00000
>                 Bloom filter space used, bytes: 1840
>                 Compacted partition minimum bytes: 447
>                 Compacted partition maximum bytes: 20501
>                 Compacted partition mean bytes: 3478
>                 Average live cells per slice (last five minutes): 0.0
>                 Average tombstones per slice (last five minutes): 0.0
>
> Is there a problem with the table structure or query that would be causing
> these failures? Should we expect timeouts as a regular occurrence in
> operation and be prepared to retry every query as needed?
>
> I'd really appreciate any input you could offer on how to improve this.
> These timeouts are causing some rather troublesome errors in our
> application.
>
> Thank you,
> Chap
>



-- 
Steve Robenalt
Software Architect
HighWire | Stanford University
425 Broadway St, Redwood City, CA 94063

srobenal@stanford.edu
http://highwire.stanford.edu