You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Joseph Tech <ja...@gmail.com> on 2016/09/01 04:16:22 UTC

Re: Read timeouts on primary key queries

Patrick,

The desc table is below (only col names changed) :

CREATE TABLE db.tbl (
    id1 text,
    id2 text,
    id3 text,
    id4 text,
    f1 text,
    f2 map<text, text>,
    f3 map<text, text>,
    created timestamp,
    updated timestamp,
    PRIMARY KEY (id1, id2, id3, id4)
) WITH CLUSTERING ORDER BY (id2 ASC, id3 ASC, id4 ASC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
    AND comment = ''
    AND compaction = {'sstable_size_in_mb': '50', 'class':
'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
    AND compression = {'sstable_compression':
'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND dclocal_read_repair_chance = 0.0
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.1
    AND speculative_retry = '99.0PERCENTILE';

and the query is select * from tbl where id1=? and id2=? and id3=? and id4=?

The timeouts happen within ~2s to ~5s, while the successful calls have avg
of 8ms and p99 of 15s. These times are seen from app side, the actual query
times would be slightly lower.

Is there a way to capture traces only when queries take longer than a
specified duration? . We can't enable tracing in production given the
volume of traffic. We see that the same query which timed out works fine
later, so not sure if the trace of a successful run would help.

Thanks,
Joseph


On Wed, Aug 31, 2016 at 8:05 PM, Patrick McFadin <pm...@gmail.com> wrote:

> If you are getting a timeout on one table, then a mismatch of RF and node
> count doesn't seem as likely.
>
> Time to look at your query. You said it was a 'select * from table where
> key=?' type query. I would next use the trace facility in cqlsh to
> investigate further. That's a good way to find hard to find issues. You
> should be looking for clear ledge where you go from single digit ms to 4 or
> 5 digit ms times.
>
> The other place to look is your data model for that table if you want to
> post the output from a desc table.
>
> Patrick
>
>
>
> On Tue, Aug 30, 2016 at 11:07 AM, Joseph Tech <ja...@gmail.com>
> wrote:
>
>> On further analysis, this issue happens only on 1 table in the KS which
>> has the max reads.
>>
>> @Atul, I will look at system health, but didnt see anything standing out
>> from GC logs. (using JDK 1.8_92 with G1GC).
>>
>> @Patrick , could you please elaborate the "mismatch on node count + RF"
>> part.
>>
>> On Tue, Aug 30, 2016 at 5:35 PM, Atul Saroha <at...@snapdeal.com>
>> wrote:
>>
>>> There could be many reasons for this if it is intermittent. CPU usage +
>>> I/O wait status. As read are I/O intensive, your IOPS requirement should be
>>> met that time load. Heap issue if CPU is busy for GC only. Network health
>>> could be the reason. So better to look system health during that time when
>>> it comes.
>>>
>>> ------------------------------------------------------------
>>> ---------------------------------------------------------
>>> Atul Saroha
>>> *Lead Software Engineer*
>>> *M*: +91 8447784271 *T*: +91 124-415-6069 *EXT*: 12369
>>> Plot # 362, ASF Centre - Tower A, Udyog Vihar,
>>>  Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA
>>>
>>> On Tue, Aug 30, 2016 at 5:10 PM, Joseph Tech <ja...@gmail.com>
>>> wrote:
>>>
>>>> Hi Patrick,
>>>>
>>>> The nodetool status shows all nodes up and normal now. From OpsCenter
>>>> "Event Log" , there are some nodes reported as being down/up etc. during
>>>> the timeframe of timeout, but these are Search workload nodes from the
>>>> remote (non-local) DC. The RF is 3 and there are 9 nodes per DC.
>>>>
>>>> Thanks,
>>>> Joseph
>>>>
>>>> On Mon, Aug 29, 2016 at 11:07 PM, Patrick McFadin <pm...@gmail.com>
>>>> wrote:
>>>>
>>>>> You aren't achieving quorum on your reads as the error is explains.
>>>>> That means you either have some nodes down or your topology is not matching
>>>>> up. The fact you are using LOCAL_QUORUM might point to a datacenter
>>>>> mis-match on node count + RF.
>>>>>
>>>>> What does your nodetool status look like?
>>>>>
>>>>> Patrick
>>>>>
>>>>> On Mon, Aug 29, 2016 at 10:14 AM, Joseph Tech <ja...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> We recently started getting intermittent timeouts on primary key
>>>>>> queries (select * from table where key=<key>)
>>>>>>
>>>>>> The error is : com.datastax.driver.core.exceptions.ReadTimeoutException:
>>>>>> Cassandra timeout during read query at consistency LOCAL_QUORUM (2
>>>>>> responses were required but only 1 replica
>>>>>> a responded)
>>>>>>
>>>>>> The same query would work fine when tried directly from cqlsh. There
>>>>>> are no indications in system.log for the table in question, though there
>>>>>> were compactions in progress for tables in another keyspace which is more
>>>>>> frequently accessed.
>>>>>>
>>>>>> My understanding is that the chances of primary key queries timing
>>>>>> out is very minimal. Please share the possible reasons / ways to debug this
>>>>>> issue.
>>>>>>
>>>>>> We are using Cassandra 2.1 (DSE 4.8.7).
>>>>>>
>>>>>> Thanks,
>>>>>> Joseph
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Read timeouts on primary key queries

Posted by Joseph Tech <ja...@gmail.com>.

Hi All,

Managed to capture the trace for a timed out query using probabilistic
tracing (attached). Seems the timeout is caused by DigestMismatchException
causing a global read repair. I guess this is due
to dclocal_read_repair_chance = 0.0 for the table. Can someone please
confirm this, and that dclocal_read_repair_chance = 0.1 will prevent this?

Thanks,
Joseph



On Thu, Sep 15, 2016 at 6:52 PM, Joseph Tech <ja...@gmail.com> wrote:

> I added the error logs and see that the timeouts are in a range b/n 2 to
> 7s. Samples below:
>
>  Query error after 5354 ms: [4 bound values] <query>
>  Query error after 6658 ms: [4 bound values] <query>
>  Query error after 4596 ms: [4 bound values] <query>
>  Query error after 2068 ms: [4 bound values] <query>
>  Query error after 2904 ms: [4 bound values] <query>
>
> There is no specific socket timeout set on the client side, so it would
> take the default of 12s. The read_request_timeout_in_ms is set to 5s. In
> this case, how do the errors happen in <5s ? . Is there any other factor
> that would cause a fail-fast scenario during the read?
>
> Thanks,
> Joseph
>
>
>
>
> On Wed, Sep 7, 2016 at 5:26 PM, Joseph Tech <ja...@gmail.com> wrote:
>
>> Thanks, Romain for the detailed explanation. We use log4j 2 and i have
>> added the driver logging for slow/error queries, will see if it helps to
>> provide any pattern once in Prod.
>>
>> I tried getendpoints and getsstables for some of the timed out keys and
>> most of them listed only 1 SSTable .There were a few which showed 2
>> SSTables. There is no specific trend on the keys, it's completely based on
>> the user access, and the same keys return results instantly from cqlsh
>>
>> On Tue, Sep 6, 2016 at 1:57 PM, Romain Hardouin <ro...@yahoo.fr>
>> wrote:
>>
>>> There is nothing special in the two sstablemetadata outuputs but if the
>>> timeouts are due to a network split or overwhelmed node or something like
>>> that you won't see anything here. That said, if you have the keys which
>>> produced the timeouts then, yes, you can look for a regular pattern (i.e.
>>> always the same keys?).
>>>
>>> You can find sstables for a given key with nodetool:
>>>     nodetool getendpoints <keyspace> <cf> <key>
>>> Then you can run the following command on one/each node of the enpoints:
>>>     nodetool getsstables <keyspace> <cf> <key>
>>>
>>> If many sstables are shown in the previous command it means that your
>>> data is fragmented but thanks to LCS this number should be low.
>>>
>>> I think the most usefull actions now would be:
>>>
>>> * 1) *Enable DEBUG for o.a.c.db.ConsistencyLevel, it won't spam your
>>> log file, you will see the following when errors will occur:
>>>     - Local replicas [<endpoint1>, ...] are insufficient to satisfy
>>> LOCAL_QUORUM requirement of X live nodes in '<dc>'
>>>
>>>     You are using C* 2.1 but you can have a look at the C* 2.2
>>> logback.xml: https://github.com/apache/cassandra/blob/cassan
>>> dra-2.2/conf/logback.xml
>>>     I'm using it on production, it's better because it creates a
>>> separate debug.log file with a asynchronous appender.
>>>
>>>    Watch out when enabling:
>>>
>>>     <appender-ref ref="ASYNCDEBUGLOG" />
>>>
>>>    Because the default logback configuration set all o.a.c in DEBUG:
>>>
>>>         <logger name="org.apache.cassandra" level="DEBUG"/>
>>>
>>>    Instead you can set:
>>>
>>>    <logger name="org.apache.cassandra" level="INFO"/>
>>>    <logger name="org.apache.cassandra.db.ConsistencyLevel"
>>> level="DEBUG"/>
>>>
>>>     Also, if you want to restrict debug.log to DEBUG level only (instead
>>> of DEBUG+INFO+...) you can add a LevelFilter to ASYNCDEBUGLOG in
>>> logback.xml:
>>>
>>>         <filter class="ch.qos.logback.classic.filter.LevelFilter">
>>>       <level>DEBUG</level>
>>>       <onMatch>ACCEPT</onMatch>
>>>       <onMismatch>DENY</onMismatch>
>>>     </filter>
>>>
>>>   Thus, the debug.log file will be empty unless some Consistency issues
>>> happen.
>>>
>>> * 2) *Enable slow queries log at the driver level with a QueryLogger:
>>>
>>>    Cluster cluster = ...
>>>    // log queries longer than 1 second, see also withDynamicThreshold
>>>    QueryLogger queryLogger = QueryLogger.builder(cluster).w
>>> ithConstantThreshold(1000).build();
>>>    cluster.register(queryLogger);
>>>
>>>     Then in your driver logback file:
>>>
>>>         <logger name="com.datastax.driver.core.QueryLogger.SLOW"
>>> level="DEBUG" />
>>>
>>>  *3) *And/or: you mentioned that you use DSE so you can enable slow
>>> queries logging in dse.yaml (cql_slow_log_options)
>>>
>>> Best,
>>>
>>> Romain
>>>
>>>
>>> Le Lundi 5 septembre 2016 20h05, Joseph Tech <ja...@gmail.com> a
>>> écrit :
>>>
>>>
>>> Attached are the sstablemeta outputs from 2 SSTables of size 28 MB and
>>> 52 MB (out2). The records are inserted with different TTLs based on their
>>> nature ; test records with 1 day, typeA records with 6 months, typeB
>>> records with 1 year etc. There are also explicit DELETEs from this table,
>>> though it's much lower than the rate of inserts.
>>>
>>> I am not sure how to interpret this output, or if it's the right
>>> SSTables that were picked. Please advise. Is there a way to get the
>>> sstables corresponding to the keys that timed out, though they are
>>> accessible later.
>>>
>>> On Mon, Sep 5, 2016 at 10:58 PM, Anshu Vajpayee <
>>> anshu.vajpayee@gmail.com> wrote:
>>>
>>> We have seen read time out issue in cassandra due to high droppable
>>> tombstone ratio for repository.
>>>
>>> Please check for high droppable tombstone ratio for your repo.
>>>
>>> On Mon, Sep 5, 2016 at 8:11 PM, Romain Hardouin <ro...@yahoo.fr>
>>> wrote:
>>>
>>> Yes dclocal_read_repair_chance will reduce the cross-DC traffic and
>>> latency, so you can swap the values ( https://issues.apache.org/ji
>>> ra/browse/CASSANDRA-7320
>>> <https://issues.apache.org/jira/browse/CASSANDRA-7320> ). I guess the
>>> sstable_size_in_mb was set to 50 because back in the day (C* 1.0) the
>>> default size was way too small: 5 MB. So maybe someone in your company
>>> tried "10 * the default" i.e. 50 MB. Now the default is 160 MB. I don't say
>>> to change the value but just keep in mind that you're using a small value
>>> here, it could help you someday.
>>>
>>> Regarding the cells, the histograms shows an *estimation* of the min,
>>> p50, ..., p99, max of cells based on SSTables metadata. On your screenshot,
>>> the Max is 4768. So you have a partition key with ~ 4768 cells. The p99 is
>>> 1109, so 99% of your partition keys have less than (or equal to) 1109
>>> cells.
>>> You can see these data of a given sstable with the tool sstablemetadata.
>>>
>>> Best,
>>>
>>> Romain
>>>
>>>
>>>
>>> Le Lundi 5 septembre 2016 15h17, Joseph Tech <ja...@gmail.com> a
>>> écrit :
>>>
>>>
>>> Thanks, Romain . We will try to enable the DEBUG logging (assuming it
>>> won't clog the logs much) . Regarding the table configs, read_repair_chance
>>> must be carried over from older versions - mostly defaults. I think sstable_size_in_mb
>>> was set to limit the max SSTable size, though i am not sure on the reason
>>> for the 50 MB value.
>>>
>>> Does setting dclocal_read_repair_chance help in reducing cross-DC
>>> traffic (haven't looked into this parameter, just going by the name).
>>>
>>> By the cell count definition : is it incremented based on the number of
>>> writes for a given name(key?) and value. This table is heavy on reads and
>>> writes. If so, the value should be much higher?
>>>
>>> On Mon, Sep 5, 2016 at 7:35 AM, Romain Hardouin <ro...@yahoo.fr>
>>> wrote:
>>>
>>> Hi,
>>>
>>> Try to put org.apache.cassandra.db. ConsistencyLevel at DEBUG level, it
>>> could help to find a regular pattern. By the way, I see that you have set a
>>> global read repair chance:
>>>     read_repair_chance = 0.1
>>> And not the local read repair:
>>>     dclocal_read_repair_chance = 0.0
>>> Is there any reason to do that or is it just the old (pre 2.0.9) default
>>> configuration?
>>>
>>> The cell count is the number of triplets: (name, value, timestamp)
>>>
>>> Also, I see that you have set sstable_size_in_mb at 50 MB. What is the
>>> rational behind this? (Yes I'm curious :-) ). Anyway your "SSTables per
>>> read" are good.
>>>
>>> Best,
>>>
>>> Romain
>>>
>>> Le Lundi 5 septembre 2016 13h32, Joseph Tech <ja...@gmail.com> a
>>> écrit :
>>>
>>>
>>> Hi Ryan,
>>>
>>> Attached are the cfhistograms run within few mins of each other. On the
>>> surface, don't see anything which indicates too much skewing (assuming
>>> skewing ==keys spread across many SSTables) . Please confirm. Related to
>>> this, what does the "cell count" metric indicate ; didn't find a clear
>>> explanation in the documents.
>>>
>>> Thanks,
>>> Joseph
>>>
>>>
>>> On Thu, Sep 1, 2016 at 6:30 PM, Ryan Svihla <rs...@foundev.pro> wrote:
>>>
>>> Have you looked at cfhistograms/tablehistograms your data maybe just
>>> skewed (most likely explanation is probably the correct one here)
>>>
>>> Regard,
>>>
>>> Ryan Svihla
>>>
>>> _____________________________
>>> From: Joseph Tech <ja...@gmail.com>
>>> Sent: Wednesday, August 31, 2016 11:16 PM
>>> Subject: Re: Read timeouts on primary key queries
>>> To: <us...@cassandra.apache.org>
>>>
>>>
>>>
>>> Patrick,
>>>
>>> The desc table is below (only col names changed) :
>>>
>>> CREATE TABLE db.tbl (
>>>     id1 text,
>>>     id2 text,
>>>     id3 text,
>>>     id4 text,
>>>     f1 text,
>>>     f2 map<text, text>,
>>>     f3 map<text, text>,
>>>     created timestamp,
>>>     updated timestamp,
>>>     PRIMARY KEY (id1, id2, id3, id4)
>>> ) WITH CLUSTERING ORDER BY (id2 ASC, id3 ASC, id4 ASC)
>>>     AND bloom_filter_fp_chance = 0.01
>>>     AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
>>>     AND comment = ''
>>>     AND compaction = {'sstable_size_in_mb': '50', 'class':
>>> 'org.apache.cassandra.db. compaction. LeveledCompactionStrategy'}
>>>     AND compression = {'sstable_compression': 'org.apache.cassandra.io.
>>> compress.LZ4Compressor'}
>>>     AND dclocal_read_repair_chance = 0.0
>>>     AND default_time_to_live = 0
>>>     AND gc_grace_seconds = 864000
>>>     AND max_index_interval = 2048
>>>     AND memtable_flush_period_in_ms = 0
>>>     AND min_index_interval = 128
>>>     AND read_repair_chance = 0.1
>>>     AND speculative_retry = '99.0PERCENTILE';
>>>
>>> and the query is select * from tbl where id1=? and id2=? and id3=? and
>>> id4=?
>>>
>>> The timeouts happen within ~2s to ~5s, while the successful calls have
>>> avg of 8ms and p99 of 15s. These times are seen from app side, the actual
>>> query times would be slightly lower.
>>>
>>> Is there a way to capture traces only when queries take longer than a
>>> specified duration? . We can't enable tracing in production given the
>>> volume of traffic. We see that the same query which timed out works fine
>>> later, so not sure if the trace of a successful run would help.
>>>
>>> Thanks,
>>> Joseph
>>>
>>>
>>> On Wed, Aug 31, 2016 at 8:05 PM, Patrick McFadin <pm...@gmail.com>
>>> wrote:
>>>
>>> If you are getting a timeout on one table, then a mismatch of RF and
>>> node count doesn't seem as likely.
>>>
>>> Time to look at your query. You said it was a 'select * from table where
>>> key=?' type query. I would next use the trace facility in cqlsh to
>>> investigate further. That's a good way to find hard to find issues. You
>>> should be looking for clear ledge where you go from single digit ms to 4 or
>>> 5 digit ms times.
>>>
>>> The other place to look is your data model for that table if you want to
>>> post the output from a desc table.
>>>
>>> Patrick
>>>
>>>
>>>
>>> On Tue, Aug 30, 2016 at 11:07 AM, Joseph Tech <ja...@gmail.com>
>>> wrote:
>>>
>>> On further analysis, this issue happens only on 1 table in the KS which
>>> has the max reads.
>>>
>>> @Atul, I will look at system health, but didnt see anything standing out
>>> from GC logs. (using JDK 1.8_92 with G1GC).
>>>
>>> @Patrick , could you please elaborate the "mismatch on node count + RF"
>>> part.
>>>
>>> On Tue, Aug 30, 2016 at 5:35 PM, Atul Saroha <at...@snapdeal.com>
>>> wrote:
>>>
>>> There could be many reasons for this if it is intermittent. CPU usage +
>>> I/O wait status. As read are I/O intensive, your IOPS requirement should be
>>> met that time load. Heap issue if CPU is busy for GC only. Network health
>>> could be the reason. So better to look system health during that time when
>>> it comes.
>>>
>>> ------------------------------ ------------------------------
>>> ------------------------------ ---------------------------
>>> Atul Saroha
>>> *Lead Software Engineer*
>>> *M*: +91 8447784271 *T*: +91 124-415-6069 *EXT*: 12369
>>> Plot # 362, ASF Centre - Tower A, Udyog Vihar,
>>>  Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA
>>>
>>> On Tue, Aug 30, 2016 at 5:10 PM, Joseph Tech <ja...@gmail.com>
>>> wrote:
>>>
>>> Hi Patrick,
>>>
>>> The nodetool status shows all nodes up and normal now. From OpsCenter
>>> "Event Log" , there are some nodes reported as being down/up etc. during
>>> the timeframe of timeout, but these are Search workload nodes from the
>>> remote (non-local) DC. The RF is 3 and there are 9 nodes per DC.
>>>
>>> Thanks,
>>> Joseph
>>>
>>> On Mon, Aug 29, 2016 at 11:07 PM, Patrick McFadin <pm...@gmail.com>
>>> wrote:
>>>
>>> You aren't achieving quorum on your reads as the error is explains. That
>>> means you either have some nodes down or your topology is not matching up.
>>> The fact you are using LOCAL_QUORUM might point to a datacenter mis-match
>>> on node count + RF.
>>>
>>> What does your nodetool status look like?
>>>
>>> Patrick
>>>
>>> On Mon, Aug 29, 2016 at 10:14 AM, Joseph Tech <ja...@gmail.com>
>>> wrote:
>>>
>>> Hi,
>>>
>>> We recently started getting intermittent timeouts on primary key queries
>>> (select * from table where key=<key>)
>>>
>>> The error is : com.datastax.driver.core.excep
>>> tions.ReadTimeoutException: Cassandra timeout during read query at
>>> consistency LOCAL_QUORUM (2 responses were required but only 1 replica
>>> a responded)
>>>
>>> The same query would work fine when tried directly from cqlsh. There are
>>> no indications in system.log for the table in question, though there were
>>> compactions in progress for tables in another keyspace which is more
>>> frequently accessed.
>>>
>>> My understanding is that the chances of primary key queries timing out
>>> is very minimal. Please share the possible reasons / ways to debug this
>>> issue.
>>>
>>> We are using Cassandra 2.1 (DSE 4.8.7).
>>>
>>> Thanks,
>>> Joseph
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> *Regards,*
>>> *Anshu *
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

Re: Read timeouts on primary key queries

Posted by Joseph Tech <ja...@gmail.com>.

I added the error logs and see that the timeouts are in a range b/n 2 to
7s. Samples below:

 Query error after 5354 ms: [4 bound values] <query>
 Query error after 6658 ms: [4 bound values] <query>
 Query error after 4596 ms: [4 bound values] <query>
 Query error after 2068 ms: [4 bound values] <query>
 Query error after 2904 ms: [4 bound values] <query>

There is no specific socket timeout set on the client side, so it would
take the default of 12s. The read_request_timeout_in_ms is set to 5s. In
this case, how do the errors happen in <5s ? . Is there any other factor
that would cause a fail-fast scenario during the read?

Thanks,
Joseph




On Wed, Sep 7, 2016 at 5:26 PM, Joseph Tech <ja...@gmail.com> wrote:

> Thanks, Romain for the detailed explanation. We use log4j 2 and i have
> added the driver logging for slow/error queries, will see if it helps to
> provide any pattern once in Prod.
>
> I tried getendpoints and getsstables for some of the timed out keys and
> most of them listed only 1 SSTable .There were a few which showed 2
> SSTables. There is no specific trend on the keys, it's completely based on
> the user access, and the same keys return results instantly from cqlsh
>
> On Tue, Sep 6, 2016 at 1:57 PM, Romain Hardouin <ro...@yahoo.fr>
> wrote:
>
>> There is nothing special in the two sstablemetadata outuputs but if the
>> timeouts are due to a network split or overwhelmed node or something like
>> that you won't see anything here. That said, if you have the keys which
>> produced the timeouts then, yes, you can look for a regular pattern (i.e.
>> always the same keys?).
>>
>> You can find sstables for a given key with nodetool:
>>     nodetool getendpoints <keyspace> <cf> <key>
>> Then you can run the following command on one/each node of the enpoints:
>>     nodetool getsstables <keyspace> <cf> <key>
>>
>> If many sstables are shown in the previous command it means that your
>> data is fragmented but thanks to LCS this number should be low.
>>
>> I think the most usefull actions now would be:
>>
>> * 1) *Enable DEBUG for o.a.c.db.ConsistencyLevel, it won't spam your log
>> file, you will see the following when errors will occur:
>>     - Local replicas [<endpoint1>, ...] are insufficient to satisfy
>> LOCAL_QUORUM requirement of X live nodes in '<dc>'
>>
>>     You are using C* 2.1 but you can have a look at the C* 2.2
>> logback.xml: https://github.com/apache/cassandra/blob/cassan
>> dra-2.2/conf/logback.xml
>>     I'm using it on production, it's better because it creates a separate
>> debug.log file with a asynchronous appender.
>>
>>    Watch out when enabling:
>>
>>     <appender-ref ref="ASYNCDEBUGLOG" />
>>
>>    Because the default logback configuration set all o.a.c in DEBUG:
>>
>>         <logger name="org.apache.cassandra" level="DEBUG"/>
>>
>>    Instead you can set:
>>
>>    <logger name="org.apache.cassandra" level="INFO"/>
>>    <logger name="org.apache.cassandra.db.ConsistencyLevel"
>> level="DEBUG"/>
>>
>>     Also, if you want to restrict debug.log to DEBUG level only (instead
>> of DEBUG+INFO+...) you can add a LevelFilter to ASYNCDEBUGLOG in
>> logback.xml:
>>
>>         <filter class="ch.qos.logback.classic.filter.LevelFilter">
>>       <level>DEBUG</level>
>>       <onMatch>ACCEPT</onMatch>
>>       <onMismatch>DENY</onMismatch>
>>     </filter>
>>
>>   Thus, the debug.log file will be empty unless some Consistency issues
>> happen.
>>
>> * 2) *Enable slow queries log at the driver level with a QueryLogger:
>>
>>    Cluster cluster = ...
>>    // log queries longer than 1 second, see also withDynamicThreshold
>>    QueryLogger queryLogger = QueryLogger.builder(cluster).w
>> ithConstantThreshold(1000).build();
>>    cluster.register(queryLogger);
>>
>>     Then in your driver logback file:
>>
>>         <logger name="com.datastax.driver.core.QueryLogger.SLOW"
>> level="DEBUG" />
>>
>>  *3) *And/or: you mentioned that you use DSE so you can enable slow
>> queries logging in dse.yaml (cql_slow_log_options)
>>
>> Best,
>>
>> Romain
>>
>>
>> Le Lundi 5 septembre 2016 20h05, Joseph Tech <ja...@gmail.com> a
>> écrit :
>>
>>
>> Attached are the sstablemeta outputs from 2 SSTables of size 28 MB and 52
>> MB (out2). The records are inserted with different TTLs based on their
>> nature ; test records with 1 day, typeA records with 6 months, typeB
>> records with 1 year etc. There are also explicit DELETEs from this table,
>> though it's much lower than the rate of inserts.
>>
>> I am not sure how to interpret this output, or if it's the right SSTables
>> that were picked. Please advise. Is there a way to get the sstables
>> corresponding to the keys that timed out, though they are accessible later.
>>
>> On Mon, Sep 5, 2016 at 10:58 PM, Anshu Vajpayee <anshu.vajpayee@gmail.com
>> > wrote:
>>
>> We have seen read time out issue in cassandra due to high droppable
>> tombstone ratio for repository.
>>
>> Please check for high droppable tombstone ratio for your repo.
>>
>> On Mon, Sep 5, 2016 at 8:11 PM, Romain Hardouin <ro...@yahoo.fr>
>> wrote:
>>
>> Yes dclocal_read_repair_chance will reduce the cross-DC traffic and
>> latency, so you can swap the values ( https://issues.apache.org/ji
>> ra/browse/CASSANDRA-7320
>> <https://issues.apache.org/jira/browse/CASSANDRA-7320> ). I guess the
>> sstable_size_in_mb was set to 50 because back in the day (C* 1.0) the
>> default size was way too small: 5 MB. So maybe someone in your company
>> tried "10 * the default" i.e. 50 MB. Now the default is 160 MB. I don't say
>> to change the value but just keep in mind that you're using a small value
>> here, it could help you someday.
>>
>> Regarding the cells, the histograms shows an *estimation* of the min,
>> p50, ..., p99, max of cells based on SSTables metadata. On your screenshot,
>> the Max is 4768. So you have a partition key with ~ 4768 cells. The p99 is
>> 1109, so 99% of your partition keys have less than (or equal to) 1109
>> cells.
>> You can see these data of a given sstable with the tool sstablemetadata.
>>
>> Best,
>>
>> Romain
>>
>>
>>
>> Le Lundi 5 septembre 2016 15h17, Joseph Tech <ja...@gmail.com> a
>> écrit :
>>
>>
>> Thanks, Romain . We will try to enable the DEBUG logging (assuming it
>> won't clog the logs much) . Regarding the table configs, read_repair_chance
>> must be carried over from older versions - mostly defaults. I think sstable_size_in_mb
>> was set to limit the max SSTable size, though i am not sure on the reason
>> for the 50 MB value.
>>
>> Does setting dclocal_read_repair_chance help in reducing cross-DC
>> traffic (haven't looked into this parameter, just going by the name).
>>
>> By the cell count definition : is it incremented based on the number of
>> writes for a given name(key?) and value. This table is heavy on reads and
>> writes. If so, the value should be much higher?
>>
>> On Mon, Sep 5, 2016 at 7:35 AM, Romain Hardouin <ro...@yahoo.fr>
>> wrote:
>>
>> Hi,
>>
>> Try to put org.apache.cassandra.db. ConsistencyLevel at DEBUG level, it
>> could help to find a regular pattern. By the way, I see that you have set a
>> global read repair chance:
>>     read_repair_chance = 0.1
>> And not the local read repair:
>>     dclocal_read_repair_chance = 0.0
>> Is there any reason to do that or is it just the old (pre 2.0.9) default
>> configuration?
>>
>> The cell count is the number of triplets: (name, value, timestamp)
>>
>> Also, I see that you have set sstable_size_in_mb at 50 MB. What is the
>> rational behind this? (Yes I'm curious :-) ). Anyway your "SSTables per
>> read" are good.
>>
>> Best,
>>
>> Romain
>>
>> Le Lundi 5 septembre 2016 13h32, Joseph Tech <ja...@gmail.com> a
>> écrit :
>>
>>
>> Hi Ryan,
>>
>> Attached are the cfhistograms run within few mins of each other. On the
>> surface, don't see anything which indicates too much skewing (assuming
>> skewing ==keys spread across many SSTables) . Please confirm. Related to
>> this, what does the "cell count" metric indicate ; didn't find a clear
>> explanation in the documents.
>>
>> Thanks,
>> Joseph
>>
>>
>> On Thu, Sep 1, 2016 at 6:30 PM, Ryan Svihla <rs...@foundev.pro> wrote:
>>
>> Have you looked at cfhistograms/tablehistograms your data maybe just
>> skewed (most likely explanation is probably the correct one here)
>>
>> Regard,
>>
>> Ryan Svihla
>>
>> _____________________________
>> From: Joseph Tech <ja...@gmail.com>
>> Sent: Wednesday, August 31, 2016 11:16 PM
>> Subject: Re: Read timeouts on primary key queries
>> To: <us...@cassandra.apache.org>
>>
>>
>>
>> Patrick,
>>
>> The desc table is below (only col names changed) :
>>
>> CREATE TABLE db.tbl (
>>     id1 text,
>>     id2 text,
>>     id3 text,
>>     id4 text,
>>     f1 text,
>>     f2 map<text, text>,
>>     f3 map<text, text>,
>>     created timestamp,
>>     updated timestamp,
>>     PRIMARY KEY (id1, id2, id3, id4)
>> ) WITH CLUSTERING ORDER BY (id2 ASC, id3 ASC, id4 ASC)
>>     AND bloom_filter_fp_chance = 0.01
>>     AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
>>     AND comment = ''
>>     AND compaction = {'sstable_size_in_mb': '50', 'class':
>> 'org.apache.cassandra.db. compaction. LeveledCompactionStrategy'}
>>     AND compression = {'sstable_compression': 'org.apache.cassandra.io.
>> compress.LZ4Compressor'}
>>     AND dclocal_read_repair_chance = 0.0
>>     AND default_time_to_live = 0
>>     AND gc_grace_seconds = 864000
>>     AND max_index_interval = 2048
>>     AND memtable_flush_period_in_ms = 0
>>     AND min_index_interval = 128
>>     AND read_repair_chance = 0.1
>>     AND speculative_retry = '99.0PERCENTILE';
>>
>> and the query is select * from tbl where id1=? and id2=? and id3=? and
>> id4=?
>>
>> The timeouts happen within ~2s to ~5s, while the successful calls have
>> avg of 8ms and p99 of 15s. These times are seen from app side, the actual
>> query times would be slightly lower.
>>
>> Is there a way to capture traces only when queries take longer than a
>> specified duration? . We can't enable tracing in production given the
>> volume of traffic. We see that the same query which timed out works fine
>> later, so not sure if the trace of a successful run would help.
>>
>> Thanks,
>> Joseph
>>
>>
>> On Wed, Aug 31, 2016 at 8:05 PM, Patrick McFadin <pm...@gmail.com>
>> wrote:
>>
>> If you are getting a timeout on one table, then a mismatch of RF and node
>> count doesn't seem as likely.
>>
>> Time to look at your query. You said it was a 'select * from table where
>> key=?' type query. I would next use the trace facility in cqlsh to
>> investigate further. That's a good way to find hard to find issues. You
>> should be looking for clear ledge where you go from single digit ms to 4 or
>> 5 digit ms times.
>>
>> The other place to look is your data model for that table if you want to
>> post the output from a desc table.
>>
>> Patrick
>>
>>
>>
>> On Tue, Aug 30, 2016 at 11:07 AM, Joseph Tech <ja...@gmail.com>
>> wrote:
>>
>> On further analysis, this issue happens only on 1 table in the KS which
>> has the max reads.
>>
>> @Atul, I will look at system health, but didnt see anything standing out
>> from GC logs. (using JDK 1.8_92 with G1GC).
>>
>> @Patrick , could you please elaborate the "mismatch on node count + RF"
>> part.
>>
>> On Tue, Aug 30, 2016 at 5:35 PM, Atul Saroha <at...@snapdeal.com>
>> wrote:
>>
>> There could be many reasons for this if it is intermittent. CPU usage +
>> I/O wait status. As read are I/O intensive, your IOPS requirement should be
>> met that time load. Heap issue if CPU is busy for GC only. Network health
>> could be the reason. So better to look system health during that time when
>> it comes.
>>
>> ------------------------------ ------------------------------
>> ------------------------------ ---------------------------
>> Atul Saroha
>> *Lead Software Engineer*
>> *M*: +91 8447784271 *T*: +91 124-415-6069 *EXT*: 12369
>> Plot # 362, ASF Centre - Tower A, Udyog Vihar,
>>  Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA
>>
>> On Tue, Aug 30, 2016 at 5:10 PM, Joseph Tech <ja...@gmail.com>
>> wrote:
>>
>> Hi Patrick,
>>
>> The nodetool status shows all nodes up and normal now. From OpsCenter
>> "Event Log" , there are some nodes reported as being down/up etc. during
>> the timeframe of timeout, but these are Search workload nodes from the
>> remote (non-local) DC. The RF is 3 and there are 9 nodes per DC.
>>
>> Thanks,
>> Joseph
>>
>> On Mon, Aug 29, 2016 at 11:07 PM, Patrick McFadin <pm...@gmail.com>
>> wrote:
>>
>> You aren't achieving quorum on your reads as the error is explains. That
>> means you either have some nodes down or your topology is not matching up.
>> The fact you are using LOCAL_QUORUM might point to a datacenter mis-match
>> on node count + RF.
>>
>> What does your nodetool status look like?
>>
>> Patrick
>>
>> On Mon, Aug 29, 2016 at 10:14 AM, Joseph Tech <ja...@gmail.com>
>> wrote:
>>
>> Hi,
>>
>> We recently started getting intermittent timeouts on primary key queries
>> (select * from table where key=<key>)
>>
>> The error is : com.datastax.driver.core.excep tions.ReadTimeoutException:
>> Cassandra timeout during read query at consistency LOCAL_QUORUM (2
>> responses were required but only 1 replica
>> a responded)
>>
>> The same query would work fine when tried directly from cqlsh. There are
>> no indications in system.log for the table in question, though there were
>> compactions in progress for tables in another keyspace which is more
>> frequently accessed.
>>
>> My understanding is that the chances of primary key queries timing out is
>> very minimal. Please share the possible reasons / ways to debug this issue.
>>
>> We are using Cassandra 2.1 (DSE 4.8.7).
>>
>> Thanks,
>> Joseph
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> *Regards,*
>> *Anshu *
>>
>>
>>
>>
>>
>>
>

Re: Read timeouts on primary key queries

Posted by Romain Hardouin <ro...@yahoo.fr>.

 Is it still fast if you specify CONSISTENCY LOCAL_QUORUM in cqlsh?
Romain    Le Mercredi 7 septembre 2016 13h56, Joseph Tech <ja...@gmail.com> a écrit :
 

 Thanks, Romain for the detailed explanation. We use log4j 2 and i have added the driver logging for slow/error queries, will see if it helps to provide any pattern once in Prod. 
I tried getendpoints and getsstables for some of the timed out keys and most of them listed only 1 SSTable .There were a few which showed 2 SSTables. There is no specific trend on the keys, it's completely based on the user access, and the same keys return results instantly from cqlsh


On Tue, Sep 6, 2016 at 1:57 PM, Romain Hardouin <ro...@yahoo.fr> wrote:

There is nothing special in the two sstablemetadata outuputs but if the timeouts are due to a network split or overwhelmed node or something like that you won't see anything here. That said, if you have the keys which produced the timeouts then, yes, you can look for a regular pattern (i.e. always the same keys?). 

You can find sstables for a given key with nodetool:    nodetool getendpoints <keyspace> <cf> <key>Then you can run the following command on one/each node of the enpoints:    nodetool getsstables <keyspace> <cf> <key>
If many sstables are shown in the previous command it means that your data is fragmented but thanks to LCS this number should be low.
I think the most usefull actions now would be:
 1) Enable DEBUG for o.a.c.db.ConsistencyLevel, it won't spam your log file, you will see the following when errors will occur:    - Local replicas [<endpoint1>, ...] are insufficient to satisfy LOCAL_QUORUM requirement of X live nodes in '<dc>'
    You are using C* 2.1 but you can have a look at the C* 2.2 logback.xml: https://github. com/apache/cassandra/blob/ cassandra-2.2/conf/logback.xml    I'm using it on production, it's better because it creates a separate debug.log file with a asynchronous appender.
   Watch out when enabling:       <appender-ref ref="ASYNCDEBUGLOG" />       Because the default logback configuration set all o.a.c in DEBUG:           <logger name="org.apache.cassandra" level="DEBUG"/>       Instead you can set:        <logger name="org.apache.cassandra" level="INFO"/>   <logger name="org.apache.cassandra.db. ConsistencyLevel" level="DEBUG"/>
    Also, if you want to restrict debug.log to DEBUG level only (instead of DEBUG+INFO+...) you can add a LevelFilter to ASYNCDEBUGLOG in logback.xml:            <filter class="ch.qos.logback.classic. filter.LevelFilter">      <level>DEBUG</level>      <onMatch>ACCEPT</onMatch>      <onMismatch>DENY</onMismatch>    </filter>
  Thus, the debug.log file will be empty unless some Consistency issues happen.    2) Enable slow queries log at the driver level with a QueryLogger:    Cluster cluster = ...   // log queries longer than 1 second, see also withDynamicThreshold   QueryLogger queryLogger = QueryLogger.builder(cluster). withConstantThreshold(1000). build();   cluster.register(queryLogger) ;        Then in your driver logback file:            <logger name="com.datastax.driver. core.QueryLogger.SLOW" level="DEBUG" />  3) And/or: you mentioned that you use DSE so you can enable slow queries logging in dse.yaml (cql_slow_log_options)
Best,
Romain   

 Le Lundi 5 septembre 2016 20h05, Joseph Tech <ja...@gmail.com> a écrit :
 

 Attached are the sstablemeta outputs from 2 SSTables of size 28 MB and 52 MB (out2). The records are inserted with different TTLs based on their nature ; test records with 1 day, typeA records with 6 months, typeB records with 1 year etc. There are also explicit DELETEs from this table, though it's much lower than the rate of inserts.
I am not sure how to interpret this output, or if it's the right SSTables that were picked. Please advise. Is there a way to get the sstables corresponding to the keys that timed out, though they are accessible later.
On Mon, Sep 5, 2016 at 10:58 PM, Anshu Vajpayee <an...@gmail.com> wrote:

We have seen read time out issue in cassandra due to high droppable tombstone ratio for repository. 
Please check for high droppable tombstone ratio for your repo. 
On Mon, Sep 5, 2016 at 8:11 PM, Romain Hardouin <ro...@yahoo.fr> wrote:

Yes dclocal_read_repair_chance will reduce the cross-DC traffic and latency, so you can swap the values ( https://issues.apache.org/ji ra/browse/CASSANDRA-7320 ). I guess the sstable_size_in_mb was set to 50 because back in the day (C* 1.0) the default size was way too small: 5 MB. So maybe someone in your company tried "10 * the default" i.e. 50 MB. Now the default is 160 MB. I don't say to change the value but just keep in mind that you're using a small value here, it could help you someday.
Regarding the cells, the histograms shows an *estimation* of the min, p50, ..., p99, max of cells based on SSTables metadata. On your screenshot, the Max is 4768. So you have a partition key with ~ 4768 cells. The p99 is 1109, so 99% of your partition keys have less than (or equal to) 1109 cells. You can see these data of a given sstable with the tool sstablemetadata.
Best,
Romain
 

    Le Lundi 5 septembre 2016 15h17, Joseph Tech <ja...@gmail.com> a écrit :
 

 Thanks, Romain . We will try to enable the DEBUG logging (assuming it won't clog the logs much) . Regarding the table configs, read_repair_chance must be carried over from older versions - mostly defaults. I think sstable_size_in_mb was set to limit the max SSTable size, though i am not sure on the reason for the 50 MB value.
Does setting dclocal_read_repair_chance help in reducing cross-DC traffic (haven't looked into this parameter, just going by the name).

By the cell count definition : is it incremented based on the number of writes for a given name(key?) and value. This table is heavy on reads and writes. If so, the value should be much higher?
On Mon, Sep 5, 2016 at 7:35 AM, Romain Hardouin <ro...@yahoo.fr> wrote:

Hi,
Try to put org.apache.cassandra.db. ConsistencyLevel at DEBUG level, it could help to find a regular pattern. By the way, I see that you have set a global read repair chance:    read_repair_chance = 0.1And not the local read repair:    dclocal_read_repair_chance = 0.0 Is there any reason to do that or is it just the old (pre 2.0.9) default configuration? 
The cell count is the number of triplets: (name, value, timestamp)
Also, I see that you have set sstable_size_in_mb at 50 MB. What is the rational behind this? (Yes I'm curious :-) ). Anyway your "SSTables per read" are good.
Best,
Romain
    Le Lundi 5 septembre 2016 13h32, Joseph Tech <ja...@gmail.com> a écrit :
 

 Hi Ryan,
Attached are the cfhistograms run within few mins of each other. On the surface, don't see anything which indicates too much skewing (assuming skewing ==keys spread across many SSTables) . Please confirm. Related to this, what does the "cell count" metric indicate ; didn't find a clear explanation in the documents.
Thanks,Joseph

On Thu, Sep 1, 2016 at 6:30 PM, Ryan Svihla <rs...@foundev.pro> wrote:

 Have you looked at cfhistograms/tablehistograms your data maybe just skewed (most likely explanation is probably the correct one here)

Regard,
Ryan Svihla
 _____________________________
From: Joseph Tech <ja...@gmail.com>
Sent: Wednesday, August 31, 2016 11:16 PM
Subject: Re: Read timeouts on primary key queries
To: <us...@cassandra.apache.org>


Patrick,
The desc table is below (only col names changed) : 
CREATE TABLE db.tbl (    id1 text,    id2 text,    id3 text,    id4 text,    f1 text,    f2 map<text, text>,    f3 map<text, text>,    created timestamp,    updated timestamp,    PRIMARY KEY (id1, id2, id3, id4)) WITH CLUSTERING ORDER BY (id2 ASC, id3 ASC, id4 ASC)    AND bloom_filter_fp_chance = 0.01    AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'    AND comment = ''    AND compaction = {'sstable_size_in_mb': '50', 'class': 'org.apache.cassandra.db. compaction. LeveledCompactionStrategy'}    AND compression = {'sstable_compression': 'org.apache.cassandra.io. compress.LZ4Compressor'}    AND dclocal_read_repair_chance = 0.0    AND default_time_to_live = 0    AND gc_grace_seconds = 864000    AND max_index_interval = 2048    AND memtable_flush_period_in_ms = 0    AND min_index_interval = 128    AND read_repair_chance = 0.1    AND speculative_retry = '99.0PERCENTILE';
and the query is select * from tbl where id1=? and id2=? and id3=? and id4=?
The timeouts happen within ~2s to ~5s, while the successful calls have avg of 8ms and p99 of 15s. These times are seen from app side, the actual query times would be slightly lower. 
Is there a way to capture traces only when queries take longer than a specified duration? . We can't enable tracing in production given the volume of traffic. We see that the same query which timed out works fine later, so not sure if the trace of a successful run would help.
Thanks,Joseph

On Wed, Aug 31, 2016 at 8:05 PM, Patrick McFadin <pm...@gmail.com> wrote:

If you are getting a timeout on one table, then a mismatch of RF and node count doesn't seem as likely. 
Time to look at your query. You said it was a 'select * from table where key=?' type query. I would next use the trace facility in cqlsh to investigate further. That's a good way to find hard to find issues. You should be looking for clear ledge where you go from single digit ms to 4 or 5 digit ms times. 
The other place to look is your data model for that table if you want to post the output from a desc table.
Patrick


On Tue, Aug 30, 2016 at 11:07 AM, Joseph Tech <ja...@gmail.com> wrote:

On further analysis, this issue happens only on 1 table in the KS which has the max reads. 
@Atul, I will look at system health, but didnt see anything standing out from GC logs. (using JDK 1.8_92 with G1GC). 
@Patrick , could you please elaborate the "mismatch on node count + RF" part.
On Tue, Aug 30, 2016 at 5:35 PM, Atul Saroha <at...@snapdeal.com> wrote:

There could be many reasons for this if it is intermittent. CPU usage + I/O wait status. As read are I/O intensive, your IOPS requirement should be met that time load. Heap issue if CPU is busy for GC only. Network health could be the reason. So better to look system health during that time when it comes.

------------------------------ ------------------------------ ------------------------------ ---------------------------
Atul Saroha
Lead Software Engineer
M: +91 8447784271 T: +91 124-415-6069 EXT: 12369
Plot # 362, ASF Centre - Tower A, Udyog Vihar,
 Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA
On Tue, Aug 30, 2016 at 5:10 PM, Joseph Tech <ja...@gmail.com> wrote:

Hi Patrick,
The nodetool status shows all nodes up and normal now. From OpsCenter "Event Log" , there are some nodes reported as being down/up etc. during the timeframe of timeout, but these are Search workload nodes from the remote (non-local) DC. The RF is 3 and there are 9 nodes per DC.
Thanks,Joseph
On Mon, Aug 29, 2016 at 11:07 PM, Patrick McFadin <pm...@gmail.com> wrote:

You aren't achieving quorum on your reads as the error is explains. That means you either have some nodes down or your topology is not matching up. The fact you are using LOCAL_QUORUM might point to a datacenter mis-match on node count + RF. 
What does your nodetool status look like?
Patrick
On Mon, Aug 29, 2016 at 10:14 AM, Joseph Tech <ja...@gmail.com> wrote:

Hi,
We recently started getting intermittent timeouts on primary key queries (select * from table where key=<key>)
The error is : com.datastax.driver.core.excep tions.ReadTimeoutException: Cassandra timeout during read query at consistency LOCAL_QUORUM (2 responses were required but only 1 replica
a responded)
The same query would work fine when tried directly from cqlsh. There are no indications in system.log for the table in question, though there were compactions in progress for tables in another keyspace which is more frequently accessed. 
My understanding is that the chances of primary key queries timing out is very minimal. Please share the possible reasons / ways to debug this issue. 

We are using Cassandra 2.1 (DSE 4.8.7).
Thanks,Joseph
















 



   



   



-- 
Regards,Anshu

Re: Read timeouts on primary key queries

Posted by Joseph Tech <ja...@gmail.com>.

Thanks, Romain for the detailed explanation. We use log4j 2 and i have
added the driver logging for slow/error queries, will see if it helps to
provide any pattern once in Prod.

I tried getendpoints and getsstables for some of the timed out keys and
most of them listed only 1 SSTable .There were a few which showed 2
SSTables. There is no specific trend on the keys, it's completely based on
the user access, and the same keys return results instantly from cqlsh

On Tue, Sep 6, 2016 at 1:57 PM, Romain Hardouin <ro...@yahoo.fr> wrote:

> There is nothing special in the two sstablemetadata outuputs but if the
> timeouts are due to a network split or overwhelmed node or something like
> that you won't see anything here. That said, if you have the keys which
> produced the timeouts then, yes, you can look for a regular pattern (i.e.
> always the same keys?).
>
> You can find sstables for a given key with nodetool:
>     nodetool getendpoints <keyspace> <cf> <key>
> Then you can run the following command on one/each node of the enpoints:
>     nodetool getsstables <keyspace> <cf> <key>
>
> If many sstables are shown in the previous command it means that your data
> is fragmented but thanks to LCS this number should be low.
>
> I think the most usefull actions now would be:
>
> * 1) *Enable DEBUG for o.a.c.db.ConsistencyLevel, it won't spam your log
> file, you will see the following when errors will occur:
>     - Local replicas [<endpoint1>, ...] are insufficient to satisfy
> LOCAL_QUORUM requirement of X live nodes in '<dc>'
>
>     You are using C* 2.1 but you can have a look at the C* 2.2
> logback.xml: https://github.com/apache/cassandra/blob/
> cassandra-2.2/conf/logback.xml
>     I'm using it on production, it's better because it creates a separate
> debug.log file with a asynchronous appender.
>
>    Watch out when enabling:
>
>     <appender-ref ref="ASYNCDEBUGLOG" />
>
>    Because the default logback configuration set all o.a.c in DEBUG:
>
>         <logger name="org.apache.cassandra" level="DEBUG"/>
>
>    Instead you can set:
>
>    <logger name="org.apache.cassandra" level="INFO"/>
>    <logger name="org.apache.cassandra.db.ConsistencyLevel" level="DEBUG"/>
>
>     Also, if you want to restrict debug.log to DEBUG level only (instead
> of DEBUG+INFO+...) you can add a LevelFilter to ASYNCDEBUGLOG in
> logback.xml:
>
>         <filter class="ch.qos.logback.classic.filter.LevelFilter">
>       <level>DEBUG</level>
>       <onMatch>ACCEPT</onMatch>
>       <onMismatch>DENY</onMismatch>
>     </filter>
>
>   Thus, the debug.log file will be empty unless some Consistency issues
> happen.
>
> * 2) *Enable slow queries log at the driver level with a QueryLogger:
>
>    Cluster cluster = ...
>    // log queries longer than 1 second, see also withDynamicThreshold
>    QueryLogger queryLogger = QueryLogger.builder(cluster).
> withConstantThreshold(1000).build();
>    cluster.register(queryLogger);
>
>     Then in your driver logback file:
>
>         <logger name="com.datastax.driver.core.QueryLogger.SLOW"
> level="DEBUG" />
>
>  *3) *And/or: you mentioned that you use DSE so you can enable slow
> queries logging in dse.yaml (cql_slow_log_options)
>
> Best,
>
> Romain
>
>
> Le Lundi 5 septembre 2016 20h05, Joseph Tech <ja...@gmail.com> a
> écrit :
>
>
> Attached are the sstablemeta outputs from 2 SSTables of size 28 MB and 52
> MB (out2). The records are inserted with different TTLs based on their
> nature ; test records with 1 day, typeA records with 6 months, typeB
> records with 1 year etc. There are also explicit DELETEs from this table,
> though it's much lower than the rate of inserts.
>
> I am not sure how to interpret this output, or if it's the right SSTables
> that were picked. Please advise. Is there a way to get the sstables
> corresponding to the keys that timed out, though they are accessible later.
>
> On Mon, Sep 5, 2016 at 10:58 PM, Anshu Vajpayee <an...@gmail.com>
> wrote:
>
> We have seen read time out issue in cassandra due to high droppable
> tombstone ratio for repository.
>
> Please check for high droppable tombstone ratio for your repo.
>
> On Mon, Sep 5, 2016 at 8:11 PM, Romain Hardouin <ro...@yahoo.fr>
> wrote:
>
> Yes dclocal_read_repair_chance will reduce the cross-DC traffic and
> latency, so you can swap the values ( https://issues.apache.org/ji
> ra/browse/CASSANDRA-7320
> <https://issues.apache.org/jira/browse/CASSANDRA-7320> ). I guess the
> sstable_size_in_mb was set to 50 because back in the day (C* 1.0) the
> default size was way too small: 5 MB. So maybe someone in your company
> tried "10 * the default" i.e. 50 MB. Now the default is 160 MB. I don't say
> to change the value but just keep in mind that you're using a small value
> here, it could help you someday.
>
> Regarding the cells, the histograms shows an *estimation* of the min, p50,
> ..., p99, max of cells based on SSTables metadata. On your screenshot, the
> Max is 4768. So you have a partition key with ~ 4768 cells. The p99 is
> 1109, so 99% of your partition keys have less than (or equal to) 1109
> cells.
> You can see these data of a given sstable with the tool sstablemetadata.
>
> Best,
>
> Romain
>
>
>
> Le Lundi 5 septembre 2016 15h17, Joseph Tech <ja...@gmail.com> a
> écrit :
>
>
> Thanks, Romain . We will try to enable the DEBUG logging (assuming it
> won't clog the logs much) . Regarding the table configs, read_repair_chance
> must be carried over from older versions - mostly defaults. I think sstable_size_in_mb
> was set to limit the max SSTable size, though i am not sure on the reason
> for the 50 MB value.
>
> Does setting dclocal_read_repair_chance help in reducing cross-DC traffic
> (haven't looked into this parameter, just going by the name).
>
> By the cell count definition : is it incremented based on the number of
> writes for a given name(key?) and value. This table is heavy on reads and
> writes. If so, the value should be much higher?
>
> On Mon, Sep 5, 2016 at 7:35 AM, Romain Hardouin <ro...@yahoo.fr>
> wrote:
>
> Hi,
>
> Try to put org.apache.cassandra.db. ConsistencyLevel at DEBUG level, it
> could help to find a regular pattern. By the way, I see that you have set a
> global read repair chance:
>     read_repair_chance = 0.1
> And not the local read repair:
>     dclocal_read_repair_chance = 0.0
> Is there any reason to do that or is it just the old (pre 2.0.9) default
> configuration?
>
> The cell count is the number of triplets: (name, value, timestamp)
>
> Also, I see that you have set sstable_size_in_mb at 50 MB. What is the
> rational behind this? (Yes I'm curious :-) ). Anyway your "SSTables per
> read" are good.
>
> Best,
>
> Romain
>
> Le Lundi 5 septembre 2016 13h32, Joseph Tech <ja...@gmail.com> a
> écrit :
>
>
> Hi Ryan,
>
> Attached are the cfhistograms run within few mins of each other. On the
> surface, don't see anything which indicates too much skewing (assuming
> skewing ==keys spread across many SSTables) . Please confirm. Related to
> this, what does the "cell count" metric indicate ; didn't find a clear
> explanation in the documents.
>
> Thanks,
> Joseph
>
>
> On Thu, Sep 1, 2016 at 6:30 PM, Ryan Svihla <rs...@foundev.pro> wrote:
>
> Have you looked at cfhistograms/tablehistograms your data maybe just
> skewed (most likely explanation is probably the correct one here)
>
> Regard,
>
> Ryan Svihla
>
> _____________________________
> From: Joseph Tech <ja...@gmail.com>
> Sent: Wednesday, August 31, 2016 11:16 PM
> Subject: Re: Read timeouts on primary key queries
> To: <us...@cassandra.apache.org>
>
>
>
> Patrick,
>
> The desc table is below (only col names changed) :
>
> CREATE TABLE db.tbl (
>     id1 text,
>     id2 text,
>     id3 text,
>     id4 text,
>     f1 text,
>     f2 map<text, text>,
>     f3 map<text, text>,
>     created timestamp,
>     updated timestamp,
>     PRIMARY KEY (id1, id2, id3, id4)
> ) WITH CLUSTERING ORDER BY (id2 ASC, id3 ASC, id4 ASC)
>     AND bloom_filter_fp_chance = 0.01
>     AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
>     AND comment = ''
>     AND compaction = {'sstable_size_in_mb': '50', 'class':
> 'org.apache.cassandra.db. compaction. LeveledCompactionStrategy'}
>     AND compression = {'sstable_compression': 'org.apache.cassandra.io.
> compress.LZ4Compressor'}
>     AND dclocal_read_repair_chance = 0.0
>     AND default_time_to_live = 0
>     AND gc_grace_seconds = 864000
>     AND max_index_interval = 2048
>     AND memtable_flush_period_in_ms = 0
>     AND min_index_interval = 128
>     AND read_repair_chance = 0.1
>     AND speculative_retry = '99.0PERCENTILE';
>
> and the query is select * from tbl where id1=? and id2=? and id3=? and
> id4=?
>
> The timeouts happen within ~2s to ~5s, while the successful calls have avg
> of 8ms and p99 of 15s. These times are seen from app side, the actual query
> times would be slightly lower.
>
> Is there a way to capture traces only when queries take longer than a
> specified duration? . We can't enable tracing in production given the
> volume of traffic. We see that the same query which timed out works fine
> later, so not sure if the trace of a successful run would help.
>
> Thanks,
> Joseph
>
>
> On Wed, Aug 31, 2016 at 8:05 PM, Patrick McFadin <pm...@gmail.com>
> wrote:
>
> If you are getting a timeout on one table, then a mismatch of RF and node
> count doesn't seem as likely.
>
> Time to look at your query. You said it was a 'select * from table where
> key=?' type query. I would next use the trace facility in cqlsh to
> investigate further. That's a good way to find hard to find issues. You
> should be looking for clear ledge where you go from single digit ms to 4 or
> 5 digit ms times.
>
> The other place to look is your data model for that table if you want to
> post the output from a desc table.
>
> Patrick
>
>
>
> On Tue, Aug 30, 2016 at 11:07 AM, Joseph Tech <ja...@gmail.com>
> wrote:
>
> On further analysis, this issue happens only on 1 table in the KS which
> has the max reads.
>
> @Atul, I will look at system health, but didnt see anything standing out
> from GC logs. (using JDK 1.8_92 with G1GC).
>
> @Patrick , could you please elaborate the "mismatch on node count + RF"
> part.
>
> On Tue, Aug 30, 2016 at 5:35 PM, Atul Saroha <at...@snapdeal.com>
> wrote:
>
> There could be many reasons for this if it is intermittent. CPU usage +
> I/O wait status. As read are I/O intensive, your IOPS requirement should be
> met that time load. Heap issue if CPU is busy for GC only. Network health
> could be the reason. So better to look system health during that time when
> it comes.
>
> ------------------------------ ------------------------------
> ------------------------------ ---------------------------
> Atul Saroha
> *Lead Software Engineer*
> *M*: +91 8447784271 *T*: +91 124-415-6069 *EXT*: 12369
> Plot # 362, ASF Centre - Tower A, Udyog Vihar,
>  Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA
>
> On Tue, Aug 30, 2016 at 5:10 PM, Joseph Tech <ja...@gmail.com>
> wrote:
>
> Hi Patrick,
>
> The nodetool status shows all nodes up and normal now. From OpsCenter
> "Event Log" , there are some nodes reported as being down/up etc. during
> the timeframe of timeout, but these are Search workload nodes from the
> remote (non-local) DC. The RF is 3 and there are 9 nodes per DC.
>
> Thanks,
> Joseph
>
> On Mon, Aug 29, 2016 at 11:07 PM, Patrick McFadin <pm...@gmail.com>
> wrote:
>
> You aren't achieving quorum on your reads as the error is explains. That
> means you either have some nodes down or your topology is not matching up.
> The fact you are using LOCAL_QUORUM might point to a datacenter mis-match
> on node count + RF.
>
> What does your nodetool status look like?
>
> Patrick
>
> On Mon, Aug 29, 2016 at 10:14 AM, Joseph Tech <ja...@gmail.com>
> wrote:
>
> Hi,
>
> We recently started getting intermittent timeouts on primary key queries
> (select * from table where key=<key>)
>
> The error is : com.datastax.driver.core.excep tions.ReadTimeoutException:
> Cassandra timeout during read query at consistency LOCAL_QUORUM (2
> responses were required but only 1 replica
> a responded)
>
> The same query would work fine when tried directly from cqlsh. There are
> no indications in system.log for the table in question, though there were
> compactions in progress for tables in another keyspace which is more
> frequently accessed.
>
> My understanding is that the chances of primary key queries timing out is
> very minimal. Please share the possible reasons / ways to debug this issue.
>
> We are using Cassandra 2.1 (DSE 4.8.7).
>
> Thanks,
> Joseph
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> --
> *Regards,*
> *Anshu *
>
>
>
>
>
>

Re: Read timeouts on primary key queries

Posted by Romain Hardouin <ro...@yahoo.fr>.

There is nothing special in the two sstablemetadata outuputs but if the timeouts are due to a network split or overwhelmed node or something like that you won't see anything here. That said, if you have the keys which produced the timeouts then, yes, you can look for a regular pattern (i.e. always the same keys?). 

You can find sstables for a given key with nodetool:    nodetool getendpoints <keyspace> <cf> <key>Then you can run the following command on one/each node of the enpoints:    nodetool getsstables <keyspace> <cf> <key>
If many sstables are shown in the previous command it means that your data is fragmented but thanks to LCS this number should be low.
I think the most usefull actions now would be:
 1) Enable DEBUG for o.a.c.db.ConsistencyLevel, it won't spam your log file, you will see the following when errors will occur:    - Local replicas [<endpoint1>, ...] are insufficient to satisfy LOCAL_QUORUM requirement of X live nodes in '<dc>'
    You are using C* 2.1 but you can have a look at the C* 2.2 logback.xml: https://github.com/apache/cassandra/blob/cassandra-2.2/conf/logback.xml    I'm using it on production, it's better because it creates a separate debug.log file with a asynchronous appender.
   Watch out when enabling:       <appender-ref ref="ASYNCDEBUGLOG" />       Because the default logback configuration set all o.a.c in DEBUG:           <logger name="org.apache.cassandra" level="DEBUG"/>       Instead you can set:        <logger name="org.apache.cassandra" level="INFO"/>   <logger name="org.apache.cassandra.db.ConsistencyLevel" level="DEBUG"/>
    Also, if you want to restrict debug.log to DEBUG level only (instead of DEBUG+INFO+...) you can add a LevelFilter to ASYNCDEBUGLOG in logback.xml:            <filter class="ch.qos.logback.classic.filter.LevelFilter">      <level>DEBUG</level>      <onMatch>ACCEPT</onMatch>      <onMismatch>DENY</onMismatch>    </filter>
  Thus, the debug.log file will be empty unless some Consistency issues happen.    2) Enable slow queries log at the driver level with a QueryLogger:    Cluster cluster = ...   // log queries longer than 1 second, see also withDynamicThreshold   QueryLogger queryLogger = QueryLogger.builder(cluster).withConstantThreshold(1000).build();   cluster.register(queryLogger);        Then in your driver logback file:            <logger name="com.datastax.driver.core.QueryLogger.SLOW" level="DEBUG" />  3) And/or: you mentioned that you use DSE so you can enable slow queries logging in dse.yaml (cql_slow_log_options)
Best,
Romain   

 Le Lundi 5 septembre 2016 20h05, Joseph Tech <ja...@gmail.com> a écrit :
 

 Attached are the sstablemeta outputs from 2 SSTables of size 28 MB and 52 MB (out2). The records are inserted with different TTLs based on their nature ; test records with 1 day, typeA records with 6 months, typeB records with 1 year etc. There are also explicit DELETEs from this table, though it's much lower than the rate of inserts.
I am not sure how to interpret this output, or if it's the right SSTables that were picked. Please advise. Is there a way to get the sstables corresponding to the keys that timed out, though they are accessible later.
On Mon, Sep 5, 2016 at 10:58 PM, Anshu Vajpayee <an...@gmail.com> wrote:

We have seen read time out issue in cassandra due to high droppable tombstone ratio for repository. 
Please check for high droppable tombstone ratio for your repo. 
On Mon, Sep 5, 2016 at 8:11 PM, Romain Hardouin <ro...@yahoo.fr> wrote:

Yes dclocal_read_repair_chance will reduce the cross-DC traffic and latency, so you can swap the values ( https://issues.apache.org/ji ra/browse/CASSANDRA-7320 ). I guess the sstable_size_in_mb was set to 50 because back in the day (C* 1.0) the default size was way too small: 5 MB. So maybe someone in your company tried "10 * the default" i.e. 50 MB. Now the default is 160 MB. I don't say to change the value but just keep in mind that you're using a small value here, it could help you someday.
Regarding the cells, the histograms shows an *estimation* of the min, p50, ..., p99, max of cells based on SSTables metadata. On your screenshot, the Max is 4768. So you have a partition key with ~ 4768 cells. The p99 is 1109, so 99% of your partition keys have less than (or equal to) 1109 cells. You can see these data of a given sstable with the tool sstablemetadata.
Best,
Romain
 

    Le Lundi 5 septembre 2016 15h17, Joseph Tech <ja...@gmail.com> a écrit :
 

 Thanks, Romain . We will try to enable the DEBUG logging (assuming it won't clog the logs much) . Regarding the table configs, read_repair_chance must be carried over from older versions - mostly defaults. I think sstable_size_in_mb was set to limit the max SSTable size, though i am not sure on the reason for the 50 MB value.
Does setting dclocal_read_repair_chance help in reducing cross-DC traffic (haven't looked into this parameter, just going by the name).

By the cell count definition : is it incremented based on the number of writes for a given name(key?) and value. This table is heavy on reads and writes. If so, the value should be much higher?
On Mon, Sep 5, 2016 at 7:35 AM, Romain Hardouin <ro...@yahoo.fr> wrote:

Hi,
Try to put org.apache.cassandra.db. ConsistencyLevel at DEBUG level, it could help to find a regular pattern. By the way, I see that you have set a global read repair chance:    read_repair_chance = 0.1And not the local read repair:    dclocal_read_repair_chance = 0.0 Is there any reason to do that or is it just the old (pre 2.0.9) default configuration? 
The cell count is the number of triplets: (name, value, timestamp)
Also, I see that you have set sstable_size_in_mb at 50 MB. What is the rational behind this? (Yes I'm curious :-) ). Anyway your "SSTables per read" are good.
Best,
Romain
    Le Lundi 5 septembre 2016 13h32, Joseph Tech <ja...@gmail.com> a écrit :
 

 Hi Ryan,
Attached are the cfhistograms run within few mins of each other. On the surface, don't see anything which indicates too much skewing (assuming skewing ==keys spread across many SSTables) . Please confirm. Related to this, what does the "cell count" metric indicate ; didn't find a clear explanation in the documents.
Thanks,Joseph

On Thu, Sep 1, 2016 at 6:30 PM, Ryan Svihla <rs...@foundev.pro> wrote:

 Have you looked at cfhistograms/tablehistograms your data maybe just skewed (most likely explanation is probably the correct one here)

Regard,
Ryan Svihla
 _____________________________
From: Joseph Tech <ja...@gmail.com>
Sent: Wednesday, August 31, 2016 11:16 PM
Subject: Re: Read timeouts on primary key queries
To: <us...@cassandra.apache.org>


Patrick,
The desc table is below (only col names changed) : 
CREATE TABLE db.tbl (    id1 text,    id2 text,    id3 text,    id4 text,    f1 text,    f2 map<text, text>,    f3 map<text, text>,    created timestamp,    updated timestamp,    PRIMARY KEY (id1, id2, id3, id4)) WITH CLUSTERING ORDER BY (id2 ASC, id3 ASC, id4 ASC)    AND bloom_filter_fp_chance = 0.01    AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'    AND comment = ''    AND compaction = {'sstable_size_in_mb': '50', 'class': 'org.apache.cassandra.db. compaction. LeveledCompactionStrategy'}    AND compression = {'sstable_compression': 'org.apache.cassandra.io. compress.LZ4Compressor'}    AND dclocal_read_repair_chance = 0.0    AND default_time_to_live = 0    AND gc_grace_seconds = 864000    AND max_index_interval = 2048    AND memtable_flush_period_in_ms = 0    AND min_index_interval = 128    AND read_repair_chance = 0.1    AND speculative_retry = '99.0PERCENTILE';
and the query is select * from tbl where id1=? and id2=? and id3=? and id4=?
The timeouts happen within ~2s to ~5s, while the successful calls have avg of 8ms and p99 of 15s. These times are seen from app side, the actual query times would be slightly lower. 
Is there a way to capture traces only when queries take longer than a specified duration? . We can't enable tracing in production given the volume of traffic. We see that the same query which timed out works fine later, so not sure if the trace of a successful run would help.
Thanks,Joseph

On Wed, Aug 31, 2016 at 8:05 PM, Patrick McFadin <pm...@gmail.com> wrote:

If you are getting a timeout on one table, then a mismatch of RF and node count doesn't seem as likely. 
Time to look at your query. You said it was a 'select * from table where key=?' type query. I would next use the trace facility in cqlsh to investigate further. That's a good way to find hard to find issues. You should be looking for clear ledge where you go from single digit ms to 4 or 5 digit ms times. 
The other place to look is your data model for that table if you want to post the output from a desc table.
Patrick


On Tue, Aug 30, 2016 at 11:07 AM, Joseph Tech <ja...@gmail.com> wrote:

On further analysis, this issue happens only on 1 table in the KS which has the max reads. 
@Atul, I will look at system health, but didnt see anything standing out from GC logs. (using JDK 1.8_92 with G1GC). 
@Patrick , could you please elaborate the "mismatch on node count + RF" part.
On Tue, Aug 30, 2016 at 5:35 PM, Atul Saroha <at...@snapdeal.com> wrote:

There could be many reasons for this if it is intermittent. CPU usage + I/O wait status. As read are I/O intensive, your IOPS requirement should be met that time load. Heap issue if CPU is busy for GC only. Network health could be the reason. So better to look system health during that time when it comes.

------------------------------ ------------------------------ ------------------------------ ---------------------------
Atul Saroha
Lead Software Engineer
M: +91 8447784271 T: +91 124-415-6069 EXT: 12369
Plot # 362, ASF Centre - Tower A, Udyog Vihar,
 Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA
On Tue, Aug 30, 2016 at 5:10 PM, Joseph Tech <ja...@gmail.com> wrote:

Hi Patrick,
The nodetool status shows all nodes up and normal now. From OpsCenter "Event Log" , there are some nodes reported as being down/up etc. during the timeframe of timeout, but these are Search workload nodes from the remote (non-local) DC. The RF is 3 and there are 9 nodes per DC.
Thanks,Joseph
On Mon, Aug 29, 2016 at 11:07 PM, Patrick McFadin <pm...@gmail.com> wrote:

You aren't achieving quorum on your reads as the error is explains. That means you either have some nodes down or your topology is not matching up. The fact you are using LOCAL_QUORUM might point to a datacenter mis-match on node count + RF. 
What does your nodetool status look like?
Patrick
On Mon, Aug 29, 2016 at 10:14 AM, Joseph Tech <ja...@gmail.com> wrote:

Hi,
We recently started getting intermittent timeouts on primary key queries (select * from table where key=<key>)
The error is : com.datastax.driver.core.excep tions.ReadTimeoutException: Cassandra timeout during read query at consistency LOCAL_QUORUM (2 responses were required but only 1 replica
a responded)
The same query would work fine when tried directly from cqlsh. There are no indications in system.log for the table in question, though there were compactions in progress for tables in another keyspace which is more frequently accessed. 
My understanding is that the chances of primary key queries timing out is very minimal. Please share the possible reasons / ways to debug this issue. 

We are using Cassandra 2.1 (DSE 4.8.7).
Thanks,Joseph
















 



   



   



-- 
Regards,Anshu

Re: Read timeouts on primary key queries

Posted by Joseph Tech <ja...@gmail.com>.

Attached are the sstablemeta outputs from 2 SSTables of size 28 MB and 52
MB (out2). The records are inserted with different TTLs based on their
nature ; test records with 1 day, typeA records with 6 months, typeB
records with 1 year etc. There are also explicit DELETEs from this table,
though it's much lower than the rate of inserts.

I am not sure how to interpret this output, or if it's the right SSTables
that were picked. Please advise. Is there a way to get the sstables
corresponding to the keys that timed out, though they are accessible later.

On Mon, Sep 5, 2016 at 10:58 PM, Anshu Vajpayee <an...@gmail.com>
wrote:

> We have seen read time out issue in cassandra due to high droppable
> tombstone ratio for repository.
>
> Please check for high droppable tombstone ratio for your repo.
>
> On Mon, Sep 5, 2016 at 8:11 PM, Romain Hardouin <ro...@yahoo.fr>
> wrote:
>
>> Yes dclocal_read_repair_chance will reduce the cross-DC traffic and
>> latency, so you can swap the values ( https://issues.apache.org/ji
>> ra/browse/CASSANDRA-7320 ). I guess the sstable_size_in_mb was set to 50
>> because back in the day (C* 1.0) the default size was way too small: 5 MB.
>> So maybe someone in your company tried "10 * the default" i.e. 50 MB. Now
>> the default is 160 MB. I don't say to change the value but just keep in
>> mind that you're using a small value here, it could help you someday.
>>
>> Regarding the cells, the histograms shows an *estimation* of the min,
>> p50, ..., p99, max of cells based on SSTables metadata. On your screenshot,
>> the Max is 4768. So you have a partition key with ~ 4768 cells. The p99 is
>> 1109, so 99% of your partition keys have less than (or equal to) 1109
>> cells.
>> You can see these data of a given sstable with the tool sstablemetadata.
>>
>> Best,
>>
>> Romain
>>
>>
>>
>> Le Lundi 5 septembre 2016 15h17, Joseph Tech <ja...@gmail.com> a
>> écrit :
>>
>>
>> Thanks, Romain . We will try to enable the DEBUG logging (assuming it
>> won't clog the logs much) . Regarding the table configs, read_repair_chance
>> must be carried over from older versions - mostly defaults. I think sstable_size_in_mb
>> was set to limit the max SSTable size, though i am not sure on the reason
>> for the 50 MB value.
>>
>> Does setting dclocal_read_repair_chance help in reducing cross-DC
>> traffic (haven't looked into this parameter, just going by the name).
>>
>> By the cell count definition : is it incremented based on the number of
>> writes for a given name(key?) and value. This table is heavy on reads and
>> writes. If so, the value should be much higher?
>>
>> On Mon, Sep 5, 2016 at 7:35 AM, Romain Hardouin <ro...@yahoo.fr>
>> wrote:
>>
>> Hi,
>>
>> Try to put org.apache.cassandra.db. ConsistencyLevel at DEBUG level, it
>> could help to find a regular pattern. By the way, I see that you have set a
>> global read repair chance:
>>     read_repair_chance = 0.1
>> And not the local read repair:
>>     dclocal_read_repair_chance = 0.0
>> Is there any reason to do that or is it just the old (pre 2.0.9) default
>> configuration?
>>
>> The cell count is the number of triplets: (name, value, timestamp)
>>
>> Also, I see that you have set sstable_size_in_mb at 50 MB. What is the
>> rational behind this? (Yes I'm curious :-) ). Anyway your "SSTables per
>> read" are good.
>>
>> Best,
>>
>> Romain
>>
>> Le Lundi 5 septembre 2016 13h32, Joseph Tech <ja...@gmail.com> a
>> écrit :
>>
>>
>> Hi Ryan,
>>
>> Attached are the cfhistograms run within few mins of each other. On the
>> surface, don't see anything which indicates too much skewing (assuming
>> skewing ==keys spread across many SSTables) . Please confirm. Related to
>> this, what does the "cell count" metric indicate ; didn't find a clear
>> explanation in the documents.
>>
>> Thanks,
>> Joseph
>>
>>
>> On Thu, Sep 1, 2016 at 6:30 PM, Ryan Svihla <rs...@foundev.pro> wrote:
>>
>> Have you looked at cfhistograms/tablehistograms your data maybe just
>> skewed (most likely explanation is probably the correct one here)
>>
>> Regard,
>>
>> Ryan Svihla
>>
>> _____________________________
>> From: Joseph Tech <ja...@gmail.com>
>> Sent: Wednesday, August 31, 2016 11:16 PM
>> Subject: Re: Read timeouts on primary key queries
>> To: <us...@cassandra.apache.org>
>>
>>
>>
>> Patrick,
>>
>> The desc table is below (only col names changed) :
>>
>> CREATE TABLE db.tbl (
>>     id1 text,
>>     id2 text,
>>     id3 text,
>>     id4 text,
>>     f1 text,
>>     f2 map<text, text>,
>>     f3 map<text, text>,
>>     created timestamp,
>>     updated timestamp,
>>     PRIMARY KEY (id1, id2, id3, id4)
>> ) WITH CLUSTERING ORDER BY (id2 ASC, id3 ASC, id4 ASC)
>>     AND bloom_filter_fp_chance = 0.01
>>     AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
>>     AND comment = ''
>>     AND compaction = {'sstable_size_in_mb': '50', 'class':
>> 'org.apache.cassandra.db. compaction. LeveledCompactionStrategy'}
>>     AND compression = {'sstable_compression': 'org.apache.cassandra.io.
>> compress.LZ4Compressor'}
>>     AND dclocal_read_repair_chance = 0.0
>>     AND default_time_to_live = 0
>>     AND gc_grace_seconds = 864000
>>     AND max_index_interval = 2048
>>     AND memtable_flush_period_in_ms = 0
>>     AND min_index_interval = 128
>>     AND read_repair_chance = 0.1
>>     AND speculative_retry = '99.0PERCENTILE';
>>
>> and the query is select * from tbl where id1=? and id2=? and id3=? and
>> id4=?
>>
>> The timeouts happen within ~2s to ~5s, while the successful calls have
>> avg of 8ms and p99 of 15s. These times are seen from app side, the actual
>> query times would be slightly lower.
>>
>> Is there a way to capture traces only when queries take longer than a
>> specified duration? . We can't enable tracing in production given the
>> volume of traffic. We see that the same query which timed out works fine
>> later, so not sure if the trace of a successful run would help.
>>
>> Thanks,
>> Joseph
>>
>>
>> On Wed, Aug 31, 2016 at 8:05 PM, Patrick McFadin <pm...@gmail.com>
>> wrote:
>>
>> If you are getting a timeout on one table, then a mismatch of RF and node
>> count doesn't seem as likely.
>>
>> Time to look at your query. You said it was a 'select * from table where
>> key=?' type query. I would next use the trace facility in cqlsh to
>> investigate further. That's a good way to find hard to find issues. You
>> should be looking for clear ledge where you go from single digit ms to 4 or
>> 5 digit ms times.
>>
>> The other place to look is your data model for that table if you want to
>> post the output from a desc table.
>>
>> Patrick
>>
>>
>>
>> On Tue, Aug 30, 2016 at 11:07 AM, Joseph Tech <ja...@gmail.com>
>> wrote:
>>
>> On further analysis, this issue happens only on 1 table in the KS which
>> has the max reads.
>>
>> @Atul, I will look at system health, but didnt see anything standing out
>> from GC logs. (using JDK 1.8_92 with G1GC).
>>
>> @Patrick , could you please elaborate the "mismatch on node count + RF"
>> part.
>>
>> On Tue, Aug 30, 2016 at 5:35 PM, Atul Saroha <at...@snapdeal.com>
>> wrote:
>>
>> There could be many reasons for this if it is intermittent. CPU usage +
>> I/O wait status. As read are I/O intensive, your IOPS requirement should be
>> met that time load. Heap issue if CPU is busy for GC only. Network health
>> could be the reason. So better to look system health during that time when
>> it comes.
>>
>> ------------------------------ ------------------------------
>> ------------------------------ ---------------------------
>> Atul Saroha
>> *Lead Software Engineer*
>> *M*: +91 8447784271 *T*: +91 124-415-6069 *EXT*: 12369
>> Plot # 362, ASF Centre - Tower A, Udyog Vihar,
>>  Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA
>>
>> On Tue, Aug 30, 2016 at 5:10 PM, Joseph Tech <ja...@gmail.com>
>> wrote:
>>
>> Hi Patrick,
>>
>> The nodetool status shows all nodes up and normal now. From OpsCenter
>> "Event Log" , there are some nodes reported as being down/up etc. during
>> the timeframe of timeout, but these are Search workload nodes from the
>> remote (non-local) DC. The RF is 3 and there are 9 nodes per DC.
>>
>> Thanks,
>> Joseph
>>
>> On Mon, Aug 29, 2016 at 11:07 PM, Patrick McFadin <pm...@gmail.com>
>> wrote:
>>
>> You aren't achieving quorum on your reads as the error is explains. That
>> means you either have some nodes down or your topology is not matching up.
>> The fact you are using LOCAL_QUORUM might point to a datacenter mis-match
>> on node count + RF.
>>
>> What does your nodetool status look like?
>>
>> Patrick
>>
>> On Mon, Aug 29, 2016 at 10:14 AM, Joseph Tech <ja...@gmail.com>
>> wrote:
>>
>> Hi,
>>
>> We recently started getting intermittent timeouts on primary key queries
>> (select * from table where key=<key>)
>>
>> The error is : com.datastax.driver.core.excep tions.ReadTimeoutException:
>> Cassandra timeout during read query at consistency LOCAL_QUORUM (2
>> responses were required but only 1 replica
>> a responded)
>>
>> The same query would work fine when tried directly from cqlsh. There are
>> no indications in system.log for the table in question, though there were
>> compactions in progress for tables in another keyspace which is more
>> frequently accessed.
>>
>> My understanding is that the chances of primary key queries timing out is
>> very minimal. Please share the possible reasons / ways to debug this issue.
>>
>> We are using Cassandra 2.1 (DSE 4.8.7).
>>
>> Thanks,
>> Joseph
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
> --
> *Regards,*
> *Anshu *
>
>
>

Re: Read timeouts on primary key queries

Posted by Anshu Vajpayee <an...@gmail.com>.

We have seen read time out issue in cassandra due to high droppable
tombstone ratio for repository.

Please check for high droppable tombstone ratio for your repo.

On Mon, Sep 5, 2016 at 8:11 PM, Romain Hardouin <ro...@yahoo.fr> wrote:

> Yes dclocal_read_repair_chance will reduce the cross-DC traffic and
> latency, so you can swap the values ( https://issues.apache.org/
> jira/browse/CASSANDRA-7320 ). I guess the sstable_size_in_mb was set to
> 50 because back in the day (C* 1.0) the default size was way too small: 5
> MB. So maybe someone in your company tried "10 * the default" i.e. 50 MB.
> Now the default is 160 MB. I don't say to change the value but just keep in
> mind that you're using a small value here, it could help you someday.
>
> Regarding the cells, the histograms shows an *estimation* of the min, p50,
> ..., p99, max of cells based on SSTables metadata. On your screenshot, the
> Max is 4768. So you have a partition key with ~ 4768 cells. The p99 is
> 1109, so 99% of your partition keys have less than (or equal to) 1109
> cells.
> You can see these data of a given sstable with the tool sstablemetadata.
>
> Best,
>
> Romain
>
>
>
> Le Lundi 5 septembre 2016 15h17, Joseph Tech <ja...@gmail.com> a
> écrit :
>
>
> Thanks, Romain . We will try to enable the DEBUG logging (assuming it
> won't clog the logs much) . Regarding the table configs, read_repair_chance
> must be carried over from older versions - mostly defaults. I think sstable_size_in_mb
> was set to limit the max SSTable size, though i am not sure on the reason
> for the 50 MB value.
>
> Does setting dclocal_read_repair_chance help in reducing cross-DC traffic
> (haven't looked into this parameter, just going by the name).
>
> By the cell count definition : is it incremented based on the number of
> writes for a given name(key?) and value. This table is heavy on reads and
> writes. If so, the value should be much higher?
>
> On Mon, Sep 5, 2016 at 7:35 AM, Romain Hardouin <ro...@yahoo.fr>
> wrote:
>
> Hi,
>
> Try to put org.apache.cassandra.db. ConsistencyLevel at DEBUG level, it
> could help to find a regular pattern. By the way, I see that you have set a
> global read repair chance:
>     read_repair_chance = 0.1
> And not the local read repair:
>     dclocal_read_repair_chance = 0.0
> Is there any reason to do that or is it just the old (pre 2.0.9) default
> configuration?
>
> The cell count is the number of triplets: (name, value, timestamp)
>
> Also, I see that you have set sstable_size_in_mb at 50 MB. What is the
> rational behind this? (Yes I'm curious :-) ). Anyway your "SSTables per
> read" are good.
>
> Best,
>
> Romain
>
> Le Lundi 5 septembre 2016 13h32, Joseph Tech <ja...@gmail.com> a
> écrit :
>
>
> Hi Ryan,
>
> Attached are the cfhistograms run within few mins of each other. On the
> surface, don't see anything which indicates too much skewing (assuming
> skewing ==keys spread across many SSTables) . Please confirm. Related to
> this, what does the "cell count" metric indicate ; didn't find a clear
> explanation in the documents.
>
> Thanks,
> Joseph
>
>
> On Thu, Sep 1, 2016 at 6:30 PM, Ryan Svihla <rs...@foundev.pro> wrote:
>
> Have you looked at cfhistograms/tablehistograms your data maybe just
> skewed (most likely explanation is probably the correct one here)
>
> Regard,
>
> Ryan Svihla
>
> _____________________________
> From: Joseph Tech <ja...@gmail.com>
> Sent: Wednesday, August 31, 2016 11:16 PM
> Subject: Re: Read timeouts on primary key queries
> To: <us...@cassandra.apache.org>
>
>
>
> Patrick,
>
> The desc table is below (only col names changed) :
>
> CREATE TABLE db.tbl (
>     id1 text,
>     id2 text,
>     id3 text,
>     id4 text,
>     f1 text,
>     f2 map<text, text>,
>     f3 map<text, text>,
>     created timestamp,
>     updated timestamp,
>     PRIMARY KEY (id1, id2, id3, id4)
> ) WITH CLUSTERING ORDER BY (id2 ASC, id3 ASC, id4 ASC)
>     AND bloom_filter_fp_chance = 0.01
>     AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
>     AND comment = ''
>     AND compaction = {'sstable_size_in_mb': '50', 'class':
> 'org.apache.cassandra.db. compaction. LeveledCompactionStrategy'}
>     AND compression = {'sstable_compression': 'org.apache.cassandra.io.
> compress.LZ4Compressor'}
>     AND dclocal_read_repair_chance = 0.0
>     AND default_time_to_live = 0
>     AND gc_grace_seconds = 864000
>     AND max_index_interval = 2048
>     AND memtable_flush_period_in_ms = 0
>     AND min_index_interval = 128
>     AND read_repair_chance = 0.1
>     AND speculative_retry = '99.0PERCENTILE';
>
> and the query is select * from tbl where id1=? and id2=? and id3=? and
> id4=?
>
> The timeouts happen within ~2s to ~5s, while the successful calls have avg
> of 8ms and p99 of 15s. These times are seen from app side, the actual query
> times would be slightly lower.
>
> Is there a way to capture traces only when queries take longer than a
> specified duration? . We can't enable tracing in production given the
> volume of traffic. We see that the same query which timed out works fine
> later, so not sure if the trace of a successful run would help.
>
> Thanks,
> Joseph
>
>
> On Wed, Aug 31, 2016 at 8:05 PM, Patrick McFadin <pm...@gmail.com>
> wrote:
>
> If you are getting a timeout on one table, then a mismatch of RF and node
> count doesn't seem as likely.
>
> Time to look at your query. You said it was a 'select * from table where
> key=?' type query. I would next use the trace facility in cqlsh to
> investigate further. That's a good way to find hard to find issues. You
> should be looking for clear ledge where you go from single digit ms to 4 or
> 5 digit ms times.
>
> The other place to look is your data model for that table if you want to
> post the output from a desc table.
>
> Patrick
>
>
>
> On Tue, Aug 30, 2016 at 11:07 AM, Joseph Tech <ja...@gmail.com>
> wrote:
>
> On further analysis, this issue happens only on 1 table in the KS which
> has the max reads.
>
> @Atul, I will look at system health, but didnt see anything standing out
> from GC logs. (using JDK 1.8_92 with G1GC).
>
> @Patrick , could you please elaborate the "mismatch on node count + RF"
> part.
>
> On Tue, Aug 30, 2016 at 5:35 PM, Atul Saroha <at...@snapdeal.com>
> wrote:
>
> There could be many reasons for this if it is intermittent. CPU usage +
> I/O wait status. As read are I/O intensive, your IOPS requirement should be
> met that time load. Heap issue if CPU is busy for GC only. Network health
> could be the reason. So better to look system health during that time when
> it comes.
>
> ------------------------------ ------------------------------
> ------------------------------ ---------------------------
> Atul Saroha
> *Lead Software Engineer*
> *M*: +91 8447784271 *T*: +91 124-415-6069 *EXT*: 12369
> Plot # 362, ASF Centre - Tower A, Udyog Vihar,
>  Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA
>
> On Tue, Aug 30, 2016 at 5:10 PM, Joseph Tech <ja...@gmail.com>
> wrote:
>
> Hi Patrick,
>
> The nodetool status shows all nodes up and normal now. From OpsCenter
> "Event Log" , there are some nodes reported as being down/up etc. during
> the timeframe of timeout, but these are Search workload nodes from the
> remote (non-local) DC. The RF is 3 and there are 9 nodes per DC.
>
> Thanks,
> Joseph
>
> On Mon, Aug 29, 2016 at 11:07 PM, Patrick McFadin <pm...@gmail.com>
> wrote:
>
> You aren't achieving quorum on your reads as the error is explains. That
> means you either have some nodes down or your topology is not matching up.
> The fact you are using LOCAL_QUORUM might point to a datacenter mis-match
> on node count + RF.
>
> What does your nodetool status look like?
>
> Patrick
>
> On Mon, Aug 29, 2016 at 10:14 AM, Joseph Tech <ja...@gmail.com>
> wrote:
>
> Hi,
>
> We recently started getting intermittent timeouts on primary key queries
> (select * from table where key=<key>)
>
> The error is : com.datastax.driver.core.excep tions.ReadTimeoutException:
> Cassandra timeout during read query at consistency LOCAL_QUORUM (2
> responses were required but only 1 replica
> a responded)
>
> The same query would work fine when tried directly from cqlsh. There are
> no indications in system.log for the table in question, though there were
> compactions in progress for tables in another keyspace which is more
> frequently accessed.
>
> My understanding is that the chances of primary key queries timing out is
> very minimal. Please share the possible reasons / ways to debug this issue.
>
> We are using Cassandra 2.1 (DSE 4.8.7).
>
> Thanks,
> Joseph
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>


-- 
*Regards,*
*Anshu *

Re: Read timeouts on primary key queries

Posted by Romain Hardouin <ro...@yahoo.fr>.

Yes dclocal_read_repair_chance will reduce the cross-DC traffic and latency, so you can swap the values ( https://issues.apache.org/jira/browse/CASSANDRA-7320 ). I guess the sstable_size_in_mb was set to 50 because back in the day (C* 1.0) the default size was way too small: 5 MB. So maybe someone in your company tried "10 * the default" i.e. 50 MB. Now the default is 160 MB. I don't say to change the value but just keep in mind that you're using a small value here, it could help you someday.
Regarding the cells, the histograms shows an *estimation* of the min, p50, ..., p99, max of cells based on SSTables metadata. On your screenshot, the Max is 4768. So you have a partition key with ~ 4768 cells. The p99 is 1109, so 99% of your partition keys have less than (or equal to) 1109 cells. You can see these data of a given sstable with the tool sstablemetadata.
Best,
Romain
 

    Le Lundi 5 septembre 2016 15h17, Joseph Tech <ja...@gmail.com> a écrit :
 

 Thanks, Romain . We will try to enable the DEBUG logging (assuming it won't clog the logs much) . Regarding the table configs, read_repair_chance must be carried over from older versions - mostly defaults. I think sstable_size_in_mb was set to limit the max SSTable size, though i am not sure on the reason for the 50 MB value.
Does setting dclocal_read_repair_chance help in reducing cross-DC traffic (haven't looked into this parameter, just going by the name).

By the cell count definition : is it incremented based on the number of writes for a given name(key?) and value. This table is heavy on reads and writes. If so, the value should be much higher?
On Mon, Sep 5, 2016 at 7:35 AM, Romain Hardouin <ro...@yahoo.fr> wrote:

Hi,
Try to put org.apache.cassandra.db. ConsistencyLevel at DEBUG level, it could help to find a regular pattern. By the way, I see that you have set a global read repair chance:    read_repair_chance = 0.1And not the local read repair:    dclocal_read_repair_chance = 0.0 Is there any reason to do that or is it just the old (pre 2.0.9) default configuration? 
The cell count is the number of triplets: (name, value, timestamp)
Also, I see that you have set sstable_size_in_mb at 50 MB. What is the rational behind this? (Yes I'm curious :-) ). Anyway your "SSTables per read" are good.
Best,
Romain
    Le Lundi 5 septembre 2016 13h32, Joseph Tech <ja...@gmail.com> a écrit :
 

 Hi Ryan,
Attached are the cfhistograms run within few mins of each other. On the surface, don't see anything which indicates too much skewing (assuming skewing ==keys spread across many SSTables) . Please confirm. Related to this, what does the "cell count" metric indicate ; didn't find a clear explanation in the documents.
Thanks,Joseph

On Thu, Sep 1, 2016 at 6:30 PM, Ryan Svihla <rs...@foundev.pro> wrote:

 Have you looked at cfhistograms/tablehistograms your data maybe just skewed (most likely explanation is probably the correct one here)

Regard,
Ryan Svihla
 _____________________________
From: Joseph Tech <ja...@gmail.com>
Sent: Wednesday, August 31, 2016 11:16 PM
Subject: Re: Read timeouts on primary key queries
To: <us...@cassandra.apache.org>


Patrick,
The desc table is below (only col names changed) : 
CREATE TABLE db.tbl (    id1 text,    id2 text,    id3 text,    id4 text,    f1 text,    f2 map<text, text>,    f3 map<text, text>,    created timestamp,    updated timestamp,    PRIMARY KEY (id1, id2, id3, id4)) WITH CLUSTERING ORDER BY (id2 ASC, id3 ASC, id4 ASC)    AND bloom_filter_fp_chance = 0.01    AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'    AND comment = ''    AND compaction = {'sstable_size_in_mb': '50', 'class': 'org.apache.cassandra.db. compaction. LeveledCompactionStrategy'}    AND compression = {'sstable_compression': 'org.apache.cassandra.io. compress.LZ4Compressor'}    AND dclocal_read_repair_chance = 0.0    AND default_time_to_live = 0    AND gc_grace_seconds = 864000    AND max_index_interval = 2048    AND memtable_flush_period_in_ms = 0    AND min_index_interval = 128    AND read_repair_chance = 0.1    AND speculative_retry = '99.0PERCENTILE';
and the query is select * from tbl where id1=? and id2=? and id3=? and id4=?
The timeouts happen within ~2s to ~5s, while the successful calls have avg of 8ms and p99 of 15s. These times are seen from app side, the actual query times would be slightly lower. 
Is there a way to capture traces only when queries take longer than a specified duration? . We can't enable tracing in production given the volume of traffic. We see that the same query which timed out works fine later, so not sure if the trace of a successful run would help.
Thanks,Joseph

On Wed, Aug 31, 2016 at 8:05 PM, Patrick McFadin <pm...@gmail.com> wrote:

If you are getting a timeout on one table, then a mismatch of RF and node count doesn't seem as likely. 
Time to look at your query. You said it was a 'select * from table where key=?' type query. I would next use the trace facility in cqlsh to investigate further. That's a good way to find hard to find issues. You should be looking for clear ledge where you go from single digit ms to 4 or 5 digit ms times. 
The other place to look is your data model for that table if you want to post the output from a desc table.
Patrick


On Tue, Aug 30, 2016 at 11:07 AM, Joseph Tech <ja...@gmail.com> wrote:

On further analysis, this issue happens only on 1 table in the KS which has the max reads. 
@Atul, I will look at system health, but didnt see anything standing out from GC logs. (using JDK 1.8_92 with G1GC). 
@Patrick , could you please elaborate the "mismatch on node count + RF" part.
On Tue, Aug 30, 2016 at 5:35 PM, Atul Saroha <at...@snapdeal.com> wrote:

There could be many reasons for this if it is intermittent. CPU usage + I/O wait status. As read are I/O intensive, your IOPS requirement should be met that time load. Heap issue if CPU is busy for GC only. Network health could be the reason. So better to look system health during that time when it comes.

------------------------------ ------------------------------ ------------------------------ ---------------------------
Atul Saroha
Lead Software Engineer
M: +91 8447784271 T: +91 124-415-6069 EXT: 12369
Plot # 362, ASF Centre - Tower A, Udyog Vihar,
 Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA
On Tue, Aug 30, 2016 at 5:10 PM, Joseph Tech <ja...@gmail.com> wrote:

Hi Patrick,
The nodetool status shows all nodes up and normal now. From OpsCenter "Event Log" , there are some nodes reported as being down/up etc. during the timeframe of timeout, but these are Search workload nodes from the remote (non-local) DC. The RF is 3 and there are 9 nodes per DC.
Thanks,Joseph
On Mon, Aug 29, 2016 at 11:07 PM, Patrick McFadin <pm...@gmail.com> wrote:

You aren't achieving quorum on your reads as the error is explains. That means you either have some nodes down or your topology is not matching up. The fact you are using LOCAL_QUORUM might point to a datacenter mis-match on node count + RF. 
What does your nodetool status look like?
Patrick
On Mon, Aug 29, 2016 at 10:14 AM, Joseph Tech <ja...@gmail.com> wrote:

Hi,
We recently started getting intermittent timeouts on primary key queries (select * from table where key=<key>)
The error is : com.datastax.driver.core.excep tions.ReadTimeoutException: Cassandra timeout during read query at consistency LOCAL_QUORUM (2 responses were required but only 1 replica
a responded)
The same query would work fine when tried directly from cqlsh. There are no indications in system.log for the table in question, though there were compactions in progress for tables in another keyspace which is more frequently accessed. 
My understanding is that the chances of primary key queries timing out is very minimal. Please share the possible reasons / ways to debug this issue. 

We are using Cassandra 2.1 (DSE 4.8.7).
Thanks,Joseph

Re: Read timeouts on primary key queries

Posted by Joseph Tech <ja...@gmail.com>.

Thanks, Romain . We will try to enable the DEBUG logging (assuming it won't
clog the logs much) . Regarding the table configs, read_repair_chance must
be carried over from older versions - mostly defaults. I think
sstable_size_in_mb
was set to limit the max SSTable size, though i am not sure on the reason
for the 50 MB value.

Does setting dclocal_read_repair_chance help in reducing cross-DC traffic
(haven't looked into this parameter, just going by the name).

By the cell count definition : is it incremented based on the number of
writes for a given name(key?) and value. This table is heavy on reads and
writes. If so, the value should be much higher?

On Mon, Sep 5, 2016 at 7:35 AM, Romain Hardouin <ro...@yahoo.fr> wrote:

> Hi,
>
> Try to put org.apache.cassandra.db.ConsistencyLevel at DEBUG level, it
> could help to find a regular pattern. By the way, I see that you have set a
> global read repair chance:
>     read_repair_chance = 0.1
> And not the local read repair:
>     dclocal_read_repair_chance = 0.0
> Is there any reason to do that or is it just the old (pre 2.0.9) default
> configuration?
>
> The cell count is the number of triplets: (name, value, timestamp)
>
> Also, I see that you have set sstable_size_in_mb at 50 MB. What is the
> rational behind this? (Yes I'm curious :-) ). Anyway your "SSTables per
> read" are good.
>
> Best,
>
> Romain
>
> Le Lundi 5 septembre 2016 13h32, Joseph Tech <ja...@gmail.com> a
> écrit :
>
>
> Hi Ryan,
>
> Attached are the cfhistograms run within few mins of each other. On the
> surface, don't see anything which indicates too much skewing (assuming
> skewing ==keys spread across many SSTables) . Please confirm. Related to
> this, what does the "cell count" metric indicate ; didn't find a clear
> explanation in the documents.
>
> Thanks,
> Joseph
>
>
> On Thu, Sep 1, 2016 at 6:30 PM, Ryan Svihla <rs...@foundev.pro> wrote:
>
> Have you looked at cfhistograms/tablehistograms your data maybe just
> skewed (most likely explanation is probably the correct one here)
>
> Regard,
>
> Ryan Svihla
>
> _____________________________
> From: Joseph Tech <ja...@gmail.com>
> Sent: Wednesday, August 31, 2016 11:16 PM
> Subject: Re: Read timeouts on primary key queries
> To: <us...@cassandra.apache.org>
>
>
>
> Patrick,
>
> The desc table is below (only col names changed) :
>
> CREATE TABLE db.tbl (
>     id1 text,
>     id2 text,
>     id3 text,
>     id4 text,
>     f1 text,
>     f2 map<text, text>,
>     f3 map<text, text>,
>     created timestamp,
>     updated timestamp,
>     PRIMARY KEY (id1, id2, id3, id4)
> ) WITH CLUSTERING ORDER BY (id2 ASC, id3 ASC, id4 ASC)
>     AND bloom_filter_fp_chance = 0.01
>     AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
>     AND comment = ''
>     AND compaction = {'sstable_size_in_mb': '50', 'class':
> 'org.apache.cassandra.db. compaction. LeveledCompactionStrategy'}
>     AND compression = {'sstable_compression': 'org.apache.cassandra.io.
> compress.LZ4Compressor'}
>     AND dclocal_read_repair_chance = 0.0
>     AND default_time_to_live = 0
>     AND gc_grace_seconds = 864000
>     AND max_index_interval = 2048
>     AND memtable_flush_period_in_ms = 0
>     AND min_index_interval = 128
>     AND read_repair_chance = 0.1
>     AND speculative_retry = '99.0PERCENTILE';
>
> and the query is select * from tbl where id1=? and id2=? and id3=? and
> id4=?
>
> The timeouts happen within ~2s to ~5s, while the successful calls have avg
> of 8ms and p99 of 15s. These times are seen from app side, the actual query
> times would be slightly lower.
>
> Is there a way to capture traces only when queries take longer than a
> specified duration? . We can't enable tracing in production given the
> volume of traffic. We see that the same query which timed out works fine
> later, so not sure if the trace of a successful run would help.
>
> Thanks,
> Joseph
>
>
> On Wed, Aug 31, 2016 at 8:05 PM, Patrick McFadin <pm...@gmail.com>
> wrote:
>
> If you are getting a timeout on one table, then a mismatch of RF and node
> count doesn't seem as likely.
>
> Time to look at your query. You said it was a 'select * from table where
> key=?' type query. I would next use the trace facility in cqlsh to
> investigate further. That's a good way to find hard to find issues. You
> should be looking for clear ledge where you go from single digit ms to 4 or
> 5 digit ms times.
>
> The other place to look is your data model for that table if you want to
> post the output from a desc table.
>
> Patrick
>
>
>
> On Tue, Aug 30, 2016 at 11:07 AM, Joseph Tech <ja...@gmail.com>
> wrote:
>
> On further analysis, this issue happens only on 1 table in the KS which
> has the max reads.
>
> @Atul, I will look at system health, but didnt see anything standing out
> from GC logs. (using JDK 1.8_92 with G1GC).
>
> @Patrick , could you please elaborate the "mismatch on node count + RF"
> part.
>
> On Tue, Aug 30, 2016 at 5:35 PM, Atul Saroha <at...@snapdeal.com>
> wrote:
>
> There could be many reasons for this if it is intermittent. CPU usage +
> I/O wait status. As read are I/O intensive, your IOPS requirement should be
> met that time load. Heap issue if CPU is busy for GC only. Network health
> could be the reason. So better to look system health during that time when
> it comes.
>
> ------------------------------ ------------------------------
> ------------------------------ ---------------------------
> Atul Saroha
> *Lead Software Engineer*
> *M*: +91 8447784271 *T*: +91 124-415-6069 *EXT*: 12369
> Plot # 362, ASF Centre - Tower A, Udyog Vihar,
>  Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA
>
> On Tue, Aug 30, 2016 at 5:10 PM, Joseph Tech <ja...@gmail.com>
> wrote:
>
> Hi Patrick,
>
> The nodetool status shows all nodes up and normal now. From OpsCenter
> "Event Log" , there are some nodes reported as being down/up etc. during
> the timeframe of timeout, but these are Search workload nodes from the
> remote (non-local) DC. The RF is 3 and there are 9 nodes per DC.
>
> Thanks,
> Joseph
>
> On Mon, Aug 29, 2016 at 11:07 PM, Patrick McFadin <pm...@gmail.com>
> wrote:
>
> You aren't achieving quorum on your reads as the error is explains. That
> means you either have some nodes down or your topology is not matching up.
> The fact you are using LOCAL_QUORUM might point to a datacenter mis-match
> on node count + RF.
>
> What does your nodetool status look like?
>
> Patrick
>
> On Mon, Aug 29, 2016 at 10:14 AM, Joseph Tech <ja...@gmail.com>
> wrote:
>
> Hi,
>
> We recently started getting intermittent timeouts on primary key queries
> (select * from table where key=<key>)
>
> The error is : com.datastax.driver.core.excep tions.ReadTimeoutException:
> Cassandra timeout during read query at consistency LOCAL_QUORUM (2
> responses were required but only 1 replica
> a responded)
>
> The same query would work fine when tried directly from cqlsh. There are
> no indications in system.log for the table in question, though there were
> compactions in progress for tables in another keyspace which is more
> frequently accessed.
>
> My understanding is that the chances of primary key queries timing out is
> very minimal. Please share the possible reasons / ways to debug this issue.
>
> We are using Cassandra 2.1 (DSE 4.8.7).
>
> Thanks,
> Joseph
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: Read timeouts on primary key queries

Posted by Romain Hardouin <ro...@yahoo.fr>.

Hi,
Try to put org.apache.cassandra.db.ConsistencyLevel at DEBUG level, it could help to find a regular pattern. By the way, I see that you have set a global read repair chance:    read_repair_chance = 0.1And not the local read repair:    dclocal_read_repair_chance = 0.0 Is there any reason to do that or is it just the old (pre 2.0.9) default configuration? 
The cell count is the number of triplets: (name, value, timestamp)
Also, I see that you have set sstable_size_in_mb at 50 MB. What is the rational behind this? (Yes I'm curious :-) ). Anyway your "SSTables per read" are good.
Best,
Romain
    Le Lundi 5 septembre 2016 13h32, Joseph Tech <ja...@gmail.com> a écrit :
 

 Hi Ryan,
Attached are the cfhistograms run within few mins of each other. On the surface, don't see anything which indicates too much skewing (assuming skewing ==keys spread across many SSTables) . Please confirm. Related to this, what does the "cell count" metric indicate ; didn't find a clear explanation in the documents.
Thanks,Joseph

On Thu, Sep 1, 2016 at 6:30 PM, Ryan Svihla <rs...@foundev.pro> wrote:

 Have you looked at cfhistograms/tablehistograms your data maybe just skewed (most likely explanation is probably the correct one here)

Regard,
Ryan Svihla
 _____________________________
From: Joseph Tech <ja...@gmail.com>
Sent: Wednesday, August 31, 2016 11:16 PM
Subject: Re: Read timeouts on primary key queries
To: <us...@cassandra.apache.org>


Patrick,
The desc table is below (only col names changed) : 
CREATE TABLE db.tbl (    id1 text,    id2 text,    id3 text,    id4 text,    f1 text,    f2 map<text, text>,    f3 map<text, text>,    created timestamp,    updated timestamp,    PRIMARY KEY (id1, id2, id3, id4)) WITH CLUSTERING ORDER BY (id2 ASC, id3 ASC, id4 ASC)    AND bloom_filter_fp_chance = 0.01    AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'    AND comment = ''    AND compaction = {'sstable_size_in_mb': '50', 'class': 'org.apache.cassandra.db. compaction. LeveledCompactionStrategy'}    AND compression = {'sstable_compression': 'org.apache.cassandra.io. compress.LZ4Compressor'}    AND dclocal_read_repair_chance = 0.0    AND default_time_to_live = 0    AND gc_grace_seconds = 864000    AND max_index_interval = 2048    AND memtable_flush_period_in_ms = 0    AND min_index_interval = 128    AND read_repair_chance = 0.1    AND speculative_retry = '99.0PERCENTILE';
and the query is select * from tbl where id1=? and id2=? and id3=? and id4=?
The timeouts happen within ~2s to ~5s, while the successful calls have avg of 8ms and p99 of 15s. These times are seen from app side, the actual query times would be slightly lower. 
Is there a way to capture traces only when queries take longer than a specified duration? . We can't enable tracing in production given the volume of traffic. We see that the same query which timed out works fine later, so not sure if the trace of a successful run would help.
Thanks,Joseph

On Wed, Aug 31, 2016 at 8:05 PM, Patrick McFadin <pm...@gmail.com> wrote:

If you are getting a timeout on one table, then a mismatch of RF and node count doesn't seem as likely. 
Time to look at your query. You said it was a 'select * from table where key=?' type query. I would next use the trace facility in cqlsh to investigate further. That's a good way to find hard to find issues. You should be looking for clear ledge where you go from single digit ms to 4 or 5 digit ms times. 
The other place to look is your data model for that table if you want to post the output from a desc table.
Patrick


On Tue, Aug 30, 2016 at 11:07 AM, Joseph Tech <ja...@gmail.com> wrote:

On further analysis, this issue happens only on 1 table in the KS which has the max reads. 
@Atul, I will look at system health, but didnt see anything standing out from GC logs. (using JDK 1.8_92 with G1GC). 
@Patrick , could you please elaborate the "mismatch on node count + RF" part.
On Tue, Aug 30, 2016 at 5:35 PM, Atul Saroha <at...@snapdeal.com> wrote:

There could be many reasons for this if it is intermittent. CPU usage + I/O wait status. As read are I/O intensive, your IOPS requirement should be met that time load. Heap issue if CPU is busy for GC only. Network health could be the reason. So better to look system health during that time when it comes.

------------------------------ ------------------------------ ------------------------------ ---------------------------
Atul Saroha
Lead Software Engineer
M: +91 8447784271 T: +91 124-415-6069 EXT: 12369
Plot # 362, ASF Centre - Tower A, Udyog Vihar,
 Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA
On Tue, Aug 30, 2016 at 5:10 PM, Joseph Tech <ja...@gmail.com> wrote:

Hi Patrick,
The nodetool status shows all nodes up and normal now. From OpsCenter "Event Log" , there are some nodes reported as being down/up etc. during the timeframe of timeout, but these are Search workload nodes from the remote (non-local) DC. The RF is 3 and there are 9 nodes per DC.
Thanks,Joseph
On Mon, Aug 29, 2016 at 11:07 PM, Patrick McFadin <pm...@gmail.com> wrote:

You aren't achieving quorum on your reads as the error is explains. That means you either have some nodes down or your topology is not matching up. The fact you are using LOCAL_QUORUM might point to a datacenter mis-match on node count + RF. 
What does your nodetool status look like?
Patrick
On Mon, Aug 29, 2016 at 10:14 AM, Joseph Tech <ja...@gmail.com> wrote:

Hi,
We recently started getting intermittent timeouts on primary key queries (select * from table where key=<key>)
The error is : com.datastax.driver.core.excep tions.ReadTimeoutException: Cassandra timeout during read query at consistency LOCAL_QUORUM (2 responses were required but only 1 replica
a responded)
The same query would work fine when tried directly from cqlsh. There are no indications in system.log for the table in question, though there were compactions in progress for tables in another keyspace which is more frequently accessed. 
My understanding is that the chances of primary key queries timing out is very minimal. Please share the possible reasons / ways to debug this issue. 

We are using Cassandra 2.1 (DSE 4.8.7).
Thanks,Joseph

Re: Read timeouts on primary key queries

Posted by Joseph Tech <ja...@gmail.com>.

Hi Ryan,

Attached are the cfhistograms run within few mins of each other. On the
surface, don't see anything which indicates too much skewing (assuming
skewing ==keys spread across many SSTables) . Please confirm. Related to
this, what does the "cell count" metric indicate ; didn't find a clear
explanation in the documents.

Thanks,
Joseph


On Thu, Sep 1, 2016 at 6:30 PM, Ryan Svihla <rs...@foundev.pro> wrote:

> Have you looked at cfhistograms/tablehistograms your data maybe just
> skewed (most likely explanation is probably the correct one here)
>
> Regard,
>
> Ryan Svihla
>
> _____________________________
> From: Joseph Tech <ja...@gmail.com>
> Sent: Wednesday, August 31, 2016 11:16 PM
> Subject: Re: Read timeouts on primary key queries
> To: <us...@cassandra.apache.org>
>
>
>
> Patrick,
>
> The desc table is below (only col names changed) :
>
> CREATE TABLE db.tbl (
>     id1 text,
>     id2 text,
>     id3 text,
>     id4 text,
>     f1 text,
>     f2 map<text, text>,
>     f3 map<text, text>,
>     created timestamp,
>     updated timestamp,
>     PRIMARY KEY (id1, id2, id3, id4)
> ) WITH CLUSTERING ORDER BY (id2 ASC, id3 ASC, id4 ASC)
>     AND bloom_filter_fp_chance = 0.01
>     AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
>     AND comment = ''
>     AND compaction = {'sstable_size_in_mb': '50', 'class':
> 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
>     AND compression = {'sstable_compression': 'org.apache.cassandra.io.
> compress.LZ4Compressor'}
>     AND dclocal_read_repair_chance = 0.0
>     AND default_time_to_live = 0
>     AND gc_grace_seconds = 864000
>     AND max_index_interval = 2048
>     AND memtable_flush_period_in_ms = 0
>     AND min_index_interval = 128
>     AND read_repair_chance = 0.1
>     AND speculative_retry = '99.0PERCENTILE';
>
> and the query is select * from tbl where id1=? and id2=? and id3=? and
> id4=?
>
> The timeouts happen within ~2s to ~5s, while the successful calls have avg
> of 8ms and p99 of 15s. These times are seen from app side, the actual query
> times would be slightly lower.
>
> Is there a way to capture traces only when queries take longer than a
> specified duration? . We can't enable tracing in production given the
> volume of traffic. We see that the same query which timed out works fine
> later, so not sure if the trace of a successful run would help.
>
> Thanks,
> Joseph
>
>
> On Wed, Aug 31, 2016 at 8:05 PM, Patrick McFadin <pm...@gmail.com>
> wrote:
>
>> If you are getting a timeout on one table, then a mismatch of RF and node
>> count doesn't seem as likely.
>>
>> Time to look at your query. You said it was a 'select * from table where
>> key=?' type query. I would next use the trace facility in cqlsh to
>> investigate further. That's a good way to find hard to find issues. You
>> should be looking for clear ledge where you go from single digit ms to 4 or
>> 5 digit ms times.
>>
>> The other place to look is your data model for that table if you want to
>> post the output from a desc table.
>>
>> Patrick
>>
>>
>>
>> On Tue, Aug 30, 2016 at 11:07 AM, Joseph Tech <ja...@gmail.com>
>> wrote:
>>
>>> On further analysis, this issue happens only on 1 table in the KS which
>>> has the max reads.
>>>
>>> @Atul, I will look at system health, but didnt see anything standing out
>>> from GC logs. (using JDK 1.8_92 with G1GC).
>>>
>>> @Patrick , could you please elaborate the "mismatch on node count + RF"
>>> part.
>>>
>>> On Tue, Aug 30, 2016 at 5:35 PM, Atul Saroha <at...@snapdeal.com>
>>> wrote:
>>>
>>>> There could be many reasons for this if it is intermittent. CPU usage +
>>>> I/O wait status. As read are I/O intensive, your IOPS requirement should be
>>>> met that time load. Heap issue if CPU is busy for GC only. Network health
>>>> could be the reason. So better to look system health during that time when
>>>> it comes.
>>>>
>>>> ------------------------------------------------------------
>>>> ---------------------------------------------------------
>>>> Atul Saroha
>>>> *Lead Software Engineer*
>>>> *M*: +91 8447784271 *T*: +91 124-415-6069 *EXT*: 12369
>>>> Plot # 362, ASF Centre - Tower A, Udyog Vihar,
>>>>  Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA
>>>>
>>>> On Tue, Aug 30, 2016 at 5:10 PM, Joseph Tech <ja...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Patrick,
>>>>>
>>>>> The nodetool status shows all nodes up and normal now. From OpsCenter
>>>>> "Event Log" , there are some nodes reported as being down/up etc. during
>>>>> the timeframe of timeout, but these are Search workload nodes from the
>>>>> remote (non-local) DC. The RF is 3 and there are 9 nodes per DC.
>>>>>
>>>>> Thanks,
>>>>> Joseph
>>>>>
>>>>> On Mon, Aug 29, 2016 at 11:07 PM, Patrick McFadin <pm...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> You aren't achieving quorum on your reads as the error is explains.
>>>>>> That means you either have some nodes down or your topology is not matching
>>>>>> up. The fact you are using LOCAL_QUORUM might point to a datacenter
>>>>>> mis-match on node count + RF.
>>>>>>
>>>>>> What does your nodetool status look like?
>>>>>>
>>>>>> Patrick
>>>>>>
>>>>>> On Mon, Aug 29, 2016 at 10:14 AM, Joseph Tech <ja...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> We recently started getting intermittent timeouts on primary key
>>>>>>> queries (select * from table where key=<key>)
>>>>>>>
>>>>>>> The error is : com.datastax.driver.core.exceptions.ReadTimeoutException:
>>>>>>> Cassandra timeout during read query at consistency LOCAL_QUORUM (2
>>>>>>> responses were required but only 1 replica
>>>>>>> a responded)
>>>>>>>
>>>>>>> The same query would work fine when tried directly from cqlsh. There
>>>>>>> are no indications in system.log for the table in question, though there
>>>>>>> were compactions in progress for tables in another keyspace which is more
>>>>>>> frequently accessed.
>>>>>>>
>>>>>>> My understanding is that the chances of primary key queries timing
>>>>>>> out is very minimal. Please share the possible reasons / ways to debug this
>>>>>>> issue.
>>>>>>>
>>>>>>> We are using Cassandra 2.1 (DSE 4.8.7).
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Joseph
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
>

Re: Read timeouts on primary key queries

Posted by Ryan Svihla <rs...@foundev.pro>.

Have you looked at cfhistograms/tablehistograms your data maybe just skewed (most likely explanation is probably the correct one here)

Regard,
Ryan Svihla

		_____________________________
From: Joseph Tech <ja...@gmail.com>
Sent: Wednesday, August 31, 2016 11:16 PM
Subject: Re: Read timeouts on primary key queries
To:  <us...@cassandra.apache.org>

Patrick,
The desc table is below (only col names changed) : 
CREATE TABLE db.tbl (    id1 text,    id2 text,    id3 text,    id4 text,    f1 text,    f2 map<text, text>,    f3 map<text, text>,    created timestamp,    updated timestamp,    PRIMARY KEY (id1, id2, id3, id4)) WITH CLUSTERING ORDER BY (id2 ASC, id3 ASC, id4 ASC)    AND bloom_filter_fp_chance = 0.01    AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'    AND comment = ''    AND compaction = {'sstable_size_in_mb': '50', 'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}    AND dclocal_read_repair_chance = 0.0    AND default_time_to_live = 0    AND gc_grace_seconds = 864000    AND max_index_interval = 2048    AND memtable_flush_period_in_ms = 0    AND min_index_interval = 128    AND read_repair_chance = 0.1    AND speculative_retry = '99.0PERCENTILE';
and the query is select * from tbl where id1=? and id2=? and id3=? and id4=?
The timeouts happen within ~2s to ~5s, while the successful calls have avg of 8ms and p99 of 15s. These times are seen from app side, the actual query times would be slightly lower. 
Is there a way to capture traces only when queries take longer than a specified duration? . We can't enable tracing in production given the volume of traffic. We see that the same query which timed out works fine later, so not sure if the trace of a successful run would help.
Thanks,Joseph

On Wed, Aug 31, 2016 at 8:05 PM, Patrick McFadin <pm...@gmail.com> wrote:
If you are getting a timeout on one table, then a mismatch of RF and node count doesn't seem as likely. 
Time to look at your query. You said it was a 'select * from table where key=?' type query. I would next use the trace facility in cqlsh to investigate further. That's a good way to find hard to find issues. You should be looking for clear ledge where you go from single digit ms to 4 or 5 digit ms times. 
The other place to look is your data model for that table if you want to post the output from a desc table.
Patrick

On Tue, Aug 30, 2016 at 11:07 AM, Joseph Tech <ja...@gmail.com> wrote:
On further analysis, this issue happens only on 1 table in the KS which has the max reads. 
@Atul, I will look at system health, but didnt see anything standing out from GC logs. (using JDK 1.8_92 with G1GC). 
@Patrick , could you please elaborate the "mismatch on node count + RF" part.
On Tue, Aug 30, 2016 at 5:35 PM, Atul Saroha <at...@snapdeal.com> wrote:
There could be many reasons for this if it is intermittent. CPU usage + I/O wait status. As read are I/O intensive, your IOPS requirement should be met that time load. Heap issue if CPU is busy for GC only. Network health could be the reason. So better to look system health during that time when it comes.

---------------------------------------------------------------------------------------------------------------------
Atul Saroha
Lead Software Engineer
M: +91 8447784271 T: +91 124-415-6069 EXT: 12369
Plot # 362, ASF Centre - Tower A, Udyog Vihar,
 Phase -4, Sector 18, Gurgaon, Haryana 122016, INDIA
On Tue, Aug 30, 2016 at 5:10 PM, Joseph Tech <ja...@gmail.com> wrote:
Hi Patrick,
The nodetool status shows all nodes up and normal now. From OpsCenter "Event Log" , there are some nodes reported as being down/up etc. during the timeframe of timeout, but these are Search workload nodes from the remote (non-local) DC. The RF is 3 and there are 9 nodes per DC.
Thanks,Joseph
On Mon, Aug 29, 2016 at 11:07 PM, Patrick McFadin <pm...@gmail.com> wrote:
You aren't achieving quorum on your reads as the error is explains. That means you either have some nodes down or your topology is not matching up. The fact you are using LOCAL_QUORUM might point to a datacenter mis-match on node count + RF. 
What does your nodetool status look like?
Patrick
On Mon, Aug 29, 2016 at 10:14 AM, Joseph Tech <ja...@gmail.com> wrote:
Hi,
We recently started getting intermittent timeouts on primary key queries (select * from table where key=<key>)
The error is : com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout during read query at consistency LOCAL_QUORUM (2 responses were required but only 1 replica
a responded)
The same query would work fine when tried directly from cqlsh. There are no indications in system.log for the table in question, though there were compactions in progress for tables in another keyspace which is more frequently accessed. 
My understanding is that the chances of primary key queries timing out is very minimal. Please share the possible reasons / ways to debug this issue. 

We are using Cassandra 2.1 (DSE 4.8.7).
Thanks,Joseph