You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Ja Sam <pt...@gmail.com> on 2015/01/12 15:35:59 UTC

Permanent ReadTimeout

*Environment*


   - Cassandra 2.1.0
   - 5 nodes in one DC (DC_A), 4 nodes in second DC (DC_B)
   - 2500 writes per seconds, I write only to DC_A with local_quorum
   - minimal reads (usually none, sometimes few)

*Problem*

After a few weeks of running I cannot read any data from my cluster,
because I have ReadTimeoutException like following:

ERROR [Thrift:15] 2015-01-07 14:16:21,124
CustomTThreadPoolServer.java:219 - Error occurred during processing of
message.
com.google.common.util.concurrent.UncheckedExecutionException:
java.lang.RuntimeException:
org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed
out - received only 2 responses.

To be precise it is not only problem in my cluster, The second one was
described here: Cassandra GC takes 30 seconds and hangs node
<http://stackoverflow.com/questions/27843538/cassandra-gc-takes-30-seconds-and-hangs-node>
and
I will try to use fix from CASSANDRA-6541
<http://issues.apache.org/jira/browse/CASSANDRA-6541> as leshkin suggested

*Diagnose *

I tried to use some tools which were presented on
http://rustyrazorblade.com/2014/09/cassandra-summit-recap-diagnosing-problems-in-production/
by Jon Haddad and have some strange result.


I tried to run same query in DC_A and DC_B with tracing enabled. Query is
simple:

   SELECT * FROM X.customer_events WHERE customer='1234567' AND
utc_day=16447 AND bucket IN (1,2,3,4,5,6,7,8,9,10);

Where table is defiied as following:

  CREATE TABLE drev_maelstrom.customer_events (customer text,utc_day
int, bucket
int, event_time bigint, event_id blob, event_type int, event blob,

  PRIMARY KEY ((customer, utc_day, bucket), event_time, event_id,
event_type)[...]

Results of the query:

1) In DC_B the query finished in less then a 0.22 of second . In DC_A more
then 2.5 (~10 times longer). -> the problem is that bucket can be in range
form -128 to 256

2) In DC_B it checked ~1000 SSTables with lines like:

   Bloom filter allows skipping sstable 50372 [SharedPool-Worker-7] |
2015-01-12 13:51:49.467001 | 192.168.71.198 |           4782

Where in DC_A it is:

   Bloom filter allows skipping sstable 118886 [SharedPool-Worker-5] |
2015-01-12 14:01:39.520001 | 192.168.61.199 |          25527

3) Total records in both DC were same.


*Question*

The question is quite simple: how can I speed up DC_A - it is my primary
DC, DC_B is mostly for backup, and there is a lot of network partitions
between A and B.

Maybe I should check something more, but I just don't have an idea what it
should be.

Re: Permanent ReadTimeout

Posted by Ja Sam <pt...@gmail.com>.
Your response is full of information, after I read it I think that I design
something wrong in my system. I will try to present what hardware I have
and what I am trying to achieve.

*Hardware:*
I have 9 machines, every machine has 10 hdd for data (not SSD) and 64 GB of
RAM.

*Requirements*
The Cassandra storage is design for audit data, so the only operation is
INSERT.
Each even have following properties: customer, UUID, event type (there are
4 types), date-time and some other properties. Event is stored as protobuf
in blob.
There are two types of customers which generates me an events: customer
with small amount daily (up to 100 events) and with lots of events daily
(up to 100 thousand). But with customer id I don't know which type of user
it is.

There are two types of queries which I need to run:
1) Select all events for customer in for date range. The range is small -
up to few days. It is an "audit" query
2) Select all events UUID for one day - it is for reconciliation process,
we need to check if every event was stored in Cassandra.

*Key-spaces*
Each day I write into two keyspaces:
1) One for storing data for audit query. The table definition I presented
in previous mails.
2) One for reconciliation only - it is one day keyspace. After
reconciliation I can safety delete it.


*Data replication*
We have set the following replication settings:
    REPLICATION = {'class' : 'NetworkTopologyStrategy', 'DC_A' : 5, 'DC_B'
: 3};
which means that all machines in DC_A have all data. In DC_B one machine
has 3/4 of data.

*Disk usage*
When I checked disk usage not all disk have same usage and used space.

*Questions*
1) Is there a way to utilize hdd better?
2) Maybe I should write to multiple keyspaces to have better hdd
utilization?
3) Are my replication settings correct? Or maybe they are too big?
4) I can easy reduce write operation just removing reconciliation keyspace,
but still I will have to find a way to run query for getting all UUIDs for
one day.

I hope I presented enough information, if something is missing just write.
Thanks again for help.



On Tue, Jan 13, 2015 at 5:35 PM, Eric Stevens <mi...@gmail.com> wrote:

> If you have fallen far behind on compaction, this is a hard situation to
> recover from.  It means that you're writing data faster than your cluster
> can absorb it.  The right path forward depends on a lot of factors, but in
> general you either need more servers or bigger servers, or else you need to
> write less data.
>
> Safely adding servers is actually hard in this situation, lots of
> aggressive compaction produces a result where bootstrapping new nodes
> (growing your cluster) causes a lot of over-streaming, meaning data that is
> getting compacted may be streamed multiple times, in the old SSTable, and
> again in the new post-compaction SSTable, and maybe again in another
> post-compaction SSTable.  For a healthy cluster, it's a trivial amount of
> overstreaming.  For an unhealthy cluster like this, you might not actually
> ever complete streaming and be able to successfully join the cluster before
> your target server's disks are full.
>
> If you can afford the space and don't already have things set up this way,
> disable compression and switch to size tiered compaction (you'll need to
> keep at least 50% of your disk space free to be safe in size tiered).  Also
> nodetool setcompactionthroughput will let you open the flood gates to try
> to catch up on compaction quickly (at the cost of read and write
> performance into the cluster).
>
> If you still can't catch up on compaction, you have a very serious
> problem.  You need to either reduce your write volume, or grow your cluster
> unsafely (disable bootstrapping new nodes) to reduce write pressure on your
> existing nodes.  Either way you should get caught up on compaction before
> you can safely add new nodes again.
>
> If you grow unsafely, you are effectively electing to discard data.  Some
> of it may be recoverable with a nodetool repair after you're caught up on
> compaction, but you will almost certainly lose some records.
>
> On Tue, Jan 13, 2015 at 2:22 AM, Ja Sam <pt...@gmail.com> wrote:
>
>> Ad 4) For sure I got a big problem. Because pending tasks: 3094
>>
>> The question is what should I change/monitor? I can present my whole
>> solution design, if it helps
>>
>> On Mon, Jan 12, 2015 at 8:32 PM, Ja Sam <pt...@gmail.com> wrote:
>>
>>> To precise your remarks:
>>>
>>> 1) About 30 sec GC. I know that after time my cluster had such problem,
>>> we added "magic" flag, but result will be in ~2 weeks (as I presented in
>>> screen on StackOverflow). If you have any idea how can fix/diagnose this
>>> problem, I will be very grateful.
>>>
>>> 2) It is probably true, but I don't think that I can change it. Our data
>>> centers are in different places and the network between them is not
>>> perfect. But as we observed network partition happened rare. Maximum is
>>> once a week for an hour.
>>>
>>> 3) We are trying to do a regular repairs (incremental), but usually they
>>> do not finish. Even local repairs have problems with finishing.
>>>
>>> 4) I will check it as soon as possible and post it here. If you have any
>>> suggestion what else should I check, you are welcome :)
>>>
>>>
>>>
>>>
>>> On Mon, Jan 12, 2015 at 7:28 PM, Eric Stevens <mi...@gmail.com> wrote:
>>>
>>>> If you're getting 30 second GC's, this all by itself could and probably
>>>> does explain the problem.
>>>>
>>>> If you're writing exclusively to A, and there are frequent partitions
>>>> between A and B, then A is potentially working a lot harder than B, because
>>>> it needs to keep track of hinted handoffs to replay to B whenever
>>>> connectivity is restored.  It's also acting as coordinator for writes which
>>>> need to end up in B eventually.  This in turn may be a significant
>>>> contributing factor to your GC pressure in A.
>>>>
>>>> I'd also grow suspicious of the integrity of B as a reliable backup of
>>>> A unless you're running repair on a regular basis.
>>>>
>>>> Also, if you have thousands of SSTables, then you're probably falling
>>>> behind on compaction, check nodetool compactionstats - you should typically
>>>> have < 5 outstanding tasks (preferably 0-1).  If you're not behind on
>>>> compaction, your sstable_size_in_mb might be a bad value for your use case.
>>>>
>>>> On Mon, Jan 12, 2015 at 7:35 AM, Ja Sam <pt...@gmail.com> wrote:
>>>>
>>>>> *Environment*
>>>>>
>>>>>
>>>>>    - Cassandra 2.1.0
>>>>>    - 5 nodes in one DC (DC_A), 4 nodes in second DC (DC_B)
>>>>>    - 2500 writes per seconds, I write only to DC_A with local_quorum
>>>>>    - minimal reads (usually none, sometimes few)
>>>>>
>>>>> *Problem*
>>>>>
>>>>> After a few weeks of running I cannot read any data from my cluster,
>>>>> because I have ReadTimeoutException like following:
>>>>>
>>>>> ERROR [Thrift:15] 2015-01-07 14:16:21,124 CustomTThreadPoolServer.java:219 - Error occurred during processing of message.
>>>>> com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 2 responses.
>>>>>
>>>>> To be precise it is not only problem in my cluster, The second one was
>>>>> described here: Cassandra GC takes 30 seconds and hangs node
>>>>> <http://stackoverflow.com/questions/27843538/cassandra-gc-takes-30-seconds-and-hangs-node> and
>>>>> I will try to use fix from CASSANDRA-6541
>>>>> <http://issues.apache.org/jira/browse/CASSANDRA-6541> as leshkin
>>>>> suggested
>>>>>
>>>>> *Diagnose *
>>>>>
>>>>> I tried to use some tools which were presented on
>>>>> http://rustyrazorblade.com/2014/09/cassandra-summit-recap-diagnosing-problems-in-production/
>>>>> by Jon Haddad and have some strange result.
>>>>>
>>>>>
>>>>> I tried to run same query in DC_A and DC_B with tracing enabled. Query
>>>>> is simple:
>>>>>
>>>>>    SELECT * FROM X.customer_events WHERE customer='1234567' AND
>>>>> utc_day=16447 AND bucket IN (1,2,3,4,5,6,7,8,9,10);
>>>>>
>>>>> Where table is defiied as following:
>>>>>
>>>>>   CREATE TABLE drev_maelstrom.customer_events (customer text,utc_day
>>>>> int, bucket int, event_time bigint, event_id blob, event_type int, event
>>>>> blob,
>>>>>
>>>>>   PRIMARY KEY ((customer, utc_day, bucket), event_time, event_id,
>>>>> event_type)[...]
>>>>>
>>>>> Results of the query:
>>>>>
>>>>> 1) In DC_B the query finished in less then a 0.22 of second . In DC_A
>>>>> more then 2.5 (~10 times longer). -> the problem is that bucket can be in
>>>>> range form -128 to 256
>>>>>
>>>>> 2) In DC_B it checked ~1000 SSTables with lines like:
>>>>>
>>>>>    Bloom filter allows skipping sstable 50372 [SharedPool-Worker-7] |
>>>>> 2015-01-12 13:51:49.467001 | 192.168.71.198 |           4782
>>>>>
>>>>> Where in DC_A it is:
>>>>>
>>>>>    Bloom filter allows skipping sstable 118886 [SharedPool-Worker-5] |
>>>>> 2015-01-12 14:01:39.520001 | 192.168.61.199 |          25527
>>>>>
>>>>> 3) Total records in both DC were same.
>>>>>
>>>>>
>>>>> *Question*
>>>>>
>>>>> The question is quite simple: how can I speed up DC_A - it is my
>>>>> primary DC, DC_B is mostly for backup, and there is a lot of network
>>>>> partitions between A and B.
>>>>>
>>>>> Maybe I should check something more, but I just don't have an idea
>>>>> what it should be.
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Permanent ReadTimeout

Posted by Eric Stevens <mi...@gmail.com>.
If you have fallen far behind on compaction, this is a hard situation to
recover from.  It means that you're writing data faster than your cluster
can absorb it.  The right path forward depends on a lot of factors, but in
general you either need more servers or bigger servers, or else you need to
write less data.

Safely adding servers is actually hard in this situation, lots of
aggressive compaction produces a result where bootstrapping new nodes
(growing your cluster) causes a lot of over-streaming, meaning data that is
getting compacted may be streamed multiple times, in the old SSTable, and
again in the new post-compaction SSTable, and maybe again in another
post-compaction SSTable.  For a healthy cluster, it's a trivial amount of
overstreaming.  For an unhealthy cluster like this, you might not actually
ever complete streaming and be able to successfully join the cluster before
your target server's disks are full.

If you can afford the space and don't already have things set up this way,
disable compression and switch to size tiered compaction (you'll need to
keep at least 50% of your disk space free to be safe in size tiered).  Also
nodetool setcompactionthroughput will let you open the flood gates to try
to catch up on compaction quickly (at the cost of read and write
performance into the cluster).

If you still can't catch up on compaction, you have a very serious
problem.  You need to either reduce your write volume, or grow your cluster
unsafely (disable bootstrapping new nodes) to reduce write pressure on your
existing nodes.  Either way you should get caught up on compaction before
you can safely add new nodes again.

If you grow unsafely, you are effectively electing to discard data.  Some
of it may be recoverable with a nodetool repair after you're caught up on
compaction, but you will almost certainly lose some records.

On Tue, Jan 13, 2015 at 2:22 AM, Ja Sam <pt...@gmail.com> wrote:

> Ad 4) For sure I got a big problem. Because pending tasks: 3094
>
> The question is what should I change/monitor? I can present my whole
> solution design, if it helps
>
> On Mon, Jan 12, 2015 at 8:32 PM, Ja Sam <pt...@gmail.com> wrote:
>
>> To precise your remarks:
>>
>> 1) About 30 sec GC. I know that after time my cluster had such problem,
>> we added "magic" flag, but result will be in ~2 weeks (as I presented in
>> screen on StackOverflow). If you have any idea how can fix/diagnose this
>> problem, I will be very grateful.
>>
>> 2) It is probably true, but I don't think that I can change it. Our data
>> centers are in different places and the network between them is not
>> perfect. But as we observed network partition happened rare. Maximum is
>> once a week for an hour.
>>
>> 3) We are trying to do a regular repairs (incremental), but usually they
>> do not finish. Even local repairs have problems with finishing.
>>
>> 4) I will check it as soon as possible and post it here. If you have any
>> suggestion what else should I check, you are welcome :)
>>
>>
>>
>>
>> On Mon, Jan 12, 2015 at 7:28 PM, Eric Stevens <mi...@gmail.com> wrote:
>>
>>> If you're getting 30 second GC's, this all by itself could and probably
>>> does explain the problem.
>>>
>>> If you're writing exclusively to A, and there are frequent partitions
>>> between A and B, then A is potentially working a lot harder than B, because
>>> it needs to keep track of hinted handoffs to replay to B whenever
>>> connectivity is restored.  It's also acting as coordinator for writes which
>>> need to end up in B eventually.  This in turn may be a significant
>>> contributing factor to your GC pressure in A.
>>>
>>> I'd also grow suspicious of the integrity of B as a reliable backup of A
>>> unless you're running repair on a regular basis.
>>>
>>> Also, if you have thousands of SSTables, then you're probably falling
>>> behind on compaction, check nodetool compactionstats - you should typically
>>> have < 5 outstanding tasks (preferably 0-1).  If you're not behind on
>>> compaction, your sstable_size_in_mb might be a bad value for your use case.
>>>
>>> On Mon, Jan 12, 2015 at 7:35 AM, Ja Sam <pt...@gmail.com> wrote:
>>>
>>>> *Environment*
>>>>
>>>>
>>>>    - Cassandra 2.1.0
>>>>    - 5 nodes in one DC (DC_A), 4 nodes in second DC (DC_B)
>>>>    - 2500 writes per seconds, I write only to DC_A with local_quorum
>>>>    - minimal reads (usually none, sometimes few)
>>>>
>>>> *Problem*
>>>>
>>>> After a few weeks of running I cannot read any data from my cluster,
>>>> because I have ReadTimeoutException like following:
>>>>
>>>> ERROR [Thrift:15] 2015-01-07 14:16:21,124 CustomTThreadPoolServer.java:219 - Error occurred during processing of message.
>>>> com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 2 responses.
>>>>
>>>> To be precise it is not only problem in my cluster, The second one was
>>>> described here: Cassandra GC takes 30 seconds and hangs node
>>>> <http://stackoverflow.com/questions/27843538/cassandra-gc-takes-30-seconds-and-hangs-node> and
>>>> I will try to use fix from CASSANDRA-6541
>>>> <http://issues.apache.org/jira/browse/CASSANDRA-6541> as leshkin
>>>> suggested
>>>>
>>>> *Diagnose *
>>>>
>>>> I tried to use some tools which were presented on
>>>> http://rustyrazorblade.com/2014/09/cassandra-summit-recap-diagnosing-problems-in-production/
>>>> by Jon Haddad and have some strange result.
>>>>
>>>>
>>>> I tried to run same query in DC_A and DC_B with tracing enabled. Query
>>>> is simple:
>>>>
>>>>    SELECT * FROM X.customer_events WHERE customer='1234567' AND
>>>> utc_day=16447 AND bucket IN (1,2,3,4,5,6,7,8,9,10);
>>>>
>>>> Where table is defiied as following:
>>>>
>>>>   CREATE TABLE drev_maelstrom.customer_events (customer text,utc_day
>>>> int, bucket int, event_time bigint, event_id blob, event_type int, event
>>>> blob,
>>>>
>>>>   PRIMARY KEY ((customer, utc_day, bucket), event_time, event_id,
>>>> event_type)[...]
>>>>
>>>> Results of the query:
>>>>
>>>> 1) In DC_B the query finished in less then a 0.22 of second . In DC_A
>>>> more then 2.5 (~10 times longer). -> the problem is that bucket can be in
>>>> range form -128 to 256
>>>>
>>>> 2) In DC_B it checked ~1000 SSTables with lines like:
>>>>
>>>>    Bloom filter allows skipping sstable 50372 [SharedPool-Worker-7] |
>>>> 2015-01-12 13:51:49.467001 | 192.168.71.198 |           4782
>>>>
>>>> Where in DC_A it is:
>>>>
>>>>    Bloom filter allows skipping sstable 118886 [SharedPool-Worker-5] |
>>>> 2015-01-12 14:01:39.520001 | 192.168.61.199 |          25527
>>>>
>>>> 3) Total records in both DC were same.
>>>>
>>>>
>>>> *Question*
>>>>
>>>> The question is quite simple: how can I speed up DC_A - it is my
>>>> primary DC, DC_B is mostly for backup, and there is a lot of network
>>>> partitions between A and B.
>>>>
>>>> Maybe I should check something more, but I just don't have an idea what
>>>> it should be.
>>>>
>>>>
>>>>
>>>
>>
>

Re: Permanent ReadTimeout

Posted by Ja Sam <pt...@gmail.com>.
Ad 4) For sure I got a big problem. Because pending tasks: 3094

The question is what should I change/monitor? I can present my whole
solution design, if it helps

On Mon, Jan 12, 2015 at 8:32 PM, Ja Sam <pt...@gmail.com> wrote:

> To precise your remarks:
>
> 1) About 30 sec GC. I know that after time my cluster had such problem, we
> added "magic" flag, but result will be in ~2 weeks (as I presented in
> screen on StackOverflow). If you have any idea how can fix/diagnose this
> problem, I will be very grateful.
>
> 2) It is probably true, but I don't think that I can change it. Our data
> centers are in different places and the network between them is not
> perfect. But as we observed network partition happened rare. Maximum is
> once a week for an hour.
>
> 3) We are trying to do a regular repairs (incremental), but usually they
> do not finish. Even local repairs have problems with finishing.
>
> 4) I will check it as soon as possible and post it here. If you have any
> suggestion what else should I check, you are welcome :)
>
>
>
>
> On Mon, Jan 12, 2015 at 7:28 PM, Eric Stevens <mi...@gmail.com> wrote:
>
>> If you're getting 30 second GC's, this all by itself could and probably
>> does explain the problem.
>>
>> If you're writing exclusively to A, and there are frequent partitions
>> between A and B, then A is potentially working a lot harder than B, because
>> it needs to keep track of hinted handoffs to replay to B whenever
>> connectivity is restored.  It's also acting as coordinator for writes which
>> need to end up in B eventually.  This in turn may be a significant
>> contributing factor to your GC pressure in A.
>>
>> I'd also grow suspicious of the integrity of B as a reliable backup of A
>> unless you're running repair on a regular basis.
>>
>> Also, if you have thousands of SSTables, then you're probably falling
>> behind on compaction, check nodetool compactionstats - you should typically
>> have < 5 outstanding tasks (preferably 0-1).  If you're not behind on
>> compaction, your sstable_size_in_mb might be a bad value for your use case.
>>
>> On Mon, Jan 12, 2015 at 7:35 AM, Ja Sam <pt...@gmail.com> wrote:
>>
>>> *Environment*
>>>
>>>
>>>    - Cassandra 2.1.0
>>>    - 5 nodes in one DC (DC_A), 4 nodes in second DC (DC_B)
>>>    - 2500 writes per seconds, I write only to DC_A with local_quorum
>>>    - minimal reads (usually none, sometimes few)
>>>
>>> *Problem*
>>>
>>> After a few weeks of running I cannot read any data from my cluster,
>>> because I have ReadTimeoutException like following:
>>>
>>> ERROR [Thrift:15] 2015-01-07 14:16:21,124 CustomTThreadPoolServer.java:219 - Error occurred during processing of message.
>>> com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 2 responses.
>>>
>>> To be precise it is not only problem in my cluster, The second one was
>>> described here: Cassandra GC takes 30 seconds and hangs node
>>> <http://stackoverflow.com/questions/27843538/cassandra-gc-takes-30-seconds-and-hangs-node> and
>>> I will try to use fix from CASSANDRA-6541
>>> <http://issues.apache.org/jira/browse/CASSANDRA-6541> as leshkin
>>> suggested
>>>
>>> *Diagnose *
>>>
>>> I tried to use some tools which were presented on
>>> http://rustyrazorblade.com/2014/09/cassandra-summit-recap-diagnosing-problems-in-production/
>>> by Jon Haddad and have some strange result.
>>>
>>>
>>> I tried to run same query in DC_A and DC_B with tracing enabled. Query
>>> is simple:
>>>
>>>    SELECT * FROM X.customer_events WHERE customer='1234567' AND
>>> utc_day=16447 AND bucket IN (1,2,3,4,5,6,7,8,9,10);
>>>
>>> Where table is defiied as following:
>>>
>>>   CREATE TABLE drev_maelstrom.customer_events (customer text,utc_day
>>> int, bucket int, event_time bigint, event_id blob, event_type int, event
>>> blob,
>>>
>>>   PRIMARY KEY ((customer, utc_day, bucket), event_time, event_id,
>>> event_type)[...]
>>>
>>> Results of the query:
>>>
>>> 1) In DC_B the query finished in less then a 0.22 of second . In DC_A
>>> more then 2.5 (~10 times longer). -> the problem is that bucket can be in
>>> range form -128 to 256
>>>
>>> 2) In DC_B it checked ~1000 SSTables with lines like:
>>>
>>>    Bloom filter allows skipping sstable 50372 [SharedPool-Worker-7] |
>>> 2015-01-12 13:51:49.467001 | 192.168.71.198 |           4782
>>>
>>> Where in DC_A it is:
>>>
>>>    Bloom filter allows skipping sstable 118886 [SharedPool-Worker-5] |
>>> 2015-01-12 14:01:39.520001 | 192.168.61.199 |          25527
>>>
>>> 3) Total records in both DC were same.
>>>
>>>
>>> *Question*
>>>
>>> The question is quite simple: how can I speed up DC_A - it is my primary
>>> DC, DC_B is mostly for backup, and there is a lot of network partitions
>>> between A and B.
>>>
>>> Maybe I should check something more, but I just don't have an idea what
>>> it should be.
>>>
>>>
>>>
>>
>

Re: Permanent ReadTimeout

Posted by Ja Sam <pt...@gmail.com>.
To precise your remarks:

1) About 30 sec GC. I know that after time my cluster had such problem, we
added "magic" flag, but result will be in ~2 weeks (as I presented in
screen on StackOverflow). If you have any idea how can fix/diagnose this
problem, I will be very grateful.

2) It is probably true, but I don't think that I can change it. Our data
centers are in different places and the network between them is not
perfect. But as we observed network partition happened rare. Maximum is
once a week for an hour.

3) We are trying to do a regular repairs (incremental), but usually they do
not finish. Even local repairs have problems with finishing.

4) I will check it as soon as possible and post it here. If you have any
suggestion what else should I check, you are welcome :)




On Mon, Jan 12, 2015 at 7:28 PM, Eric Stevens <mi...@gmail.com> wrote:

> If you're getting 30 second GC's, this all by itself could and probably
> does explain the problem.
>
> If you're writing exclusively to A, and there are frequent partitions
> between A and B, then A is potentially working a lot harder than B, because
> it needs to keep track of hinted handoffs to replay to B whenever
> connectivity is restored.  It's also acting as coordinator for writes which
> need to end up in B eventually.  This in turn may be a significant
> contributing factor to your GC pressure in A.
>
> I'd also grow suspicious of the integrity of B as a reliable backup of A
> unless you're running repair on a regular basis.
>
> Also, if you have thousands of SSTables, then you're probably falling
> behind on compaction, check nodetool compactionstats - you should typically
> have < 5 outstanding tasks (preferably 0-1).  If you're not behind on
> compaction, your sstable_size_in_mb might be a bad value for your use case.
>
> On Mon, Jan 12, 2015 at 7:35 AM, Ja Sam <pt...@gmail.com> wrote:
>
>> *Environment*
>>
>>
>>    - Cassandra 2.1.0
>>    - 5 nodes in one DC (DC_A), 4 nodes in second DC (DC_B)
>>    - 2500 writes per seconds, I write only to DC_A with local_quorum
>>    - minimal reads (usually none, sometimes few)
>>
>> *Problem*
>>
>> After a few weeks of running I cannot read any data from my cluster,
>> because I have ReadTimeoutException like following:
>>
>> ERROR [Thrift:15] 2015-01-07 14:16:21,124 CustomTThreadPoolServer.java:219 - Error occurred during processing of message.
>> com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 2 responses.
>>
>> To be precise it is not only problem in my cluster, The second one was
>> described here: Cassandra GC takes 30 seconds and hangs node
>> <http://stackoverflow.com/questions/27843538/cassandra-gc-takes-30-seconds-and-hangs-node> and
>> I will try to use fix from CASSANDRA-6541
>> <http://issues.apache.org/jira/browse/CASSANDRA-6541> as leshkin
>> suggested
>>
>> *Diagnose *
>>
>> I tried to use some tools which were presented on
>> http://rustyrazorblade.com/2014/09/cassandra-summit-recap-diagnosing-problems-in-production/
>> by Jon Haddad and have some strange result.
>>
>>
>> I tried to run same query in DC_A and DC_B with tracing enabled. Query is
>> simple:
>>
>>    SELECT * FROM X.customer_events WHERE customer='1234567' AND
>> utc_day=16447 AND bucket IN (1,2,3,4,5,6,7,8,9,10);
>>
>> Where table is defiied as following:
>>
>>   CREATE TABLE drev_maelstrom.customer_events (customer text,utc_day
>> int, bucket int, event_time bigint, event_id blob, event_type int, event
>> blob,
>>
>>   PRIMARY KEY ((customer, utc_day, bucket), event_time, event_id,
>> event_type)[...]
>>
>> Results of the query:
>>
>> 1) In DC_B the query finished in less then a 0.22 of second . In DC_A
>> more then 2.5 (~10 times longer). -> the problem is that bucket can be in
>> range form -128 to 256
>>
>> 2) In DC_B it checked ~1000 SSTables with lines like:
>>
>>    Bloom filter allows skipping sstable 50372 [SharedPool-Worker-7] |
>> 2015-01-12 13:51:49.467001 | 192.168.71.198 |           4782
>>
>> Where in DC_A it is:
>>
>>    Bloom filter allows skipping sstable 118886 [SharedPool-Worker-5] |
>> 2015-01-12 14:01:39.520001 | 192.168.61.199 |          25527
>>
>> 3) Total records in both DC were same.
>>
>>
>> *Question*
>>
>> The question is quite simple: how can I speed up DC_A - it is my primary
>> DC, DC_B is mostly for backup, and there is a lot of network partitions
>> between A and B.
>>
>> Maybe I should check something more, but I just don't have an idea what
>> it should be.
>>
>>
>>
>

Re: Permanent ReadTimeout

Posted by Eric Stevens <mi...@gmail.com>.
If you're getting 30 second GC's, this all by itself could and probably
does explain the problem.

If you're writing exclusively to A, and there are frequent partitions
between A and B, then A is potentially working a lot harder than B, because
it needs to keep track of hinted handoffs to replay to B whenever
connectivity is restored.  It's also acting as coordinator for writes which
need to end up in B eventually.  This in turn may be a significant
contributing factor to your GC pressure in A.

I'd also grow suspicious of the integrity of B as a reliable backup of A
unless you're running repair on a regular basis.

Also, if you have thousands of SSTables, then you're probably falling
behind on compaction, check nodetool compactionstats - you should typically
have < 5 outstanding tasks (preferably 0-1).  If you're not behind on
compaction, your sstable_size_in_mb might be a bad value for your use case.

On Mon, Jan 12, 2015 at 7:35 AM, Ja Sam <pt...@gmail.com> wrote:

> *Environment*
>
>
>    - Cassandra 2.1.0
>    - 5 nodes in one DC (DC_A), 4 nodes in second DC (DC_B)
>    - 2500 writes per seconds, I write only to DC_A with local_quorum
>    - minimal reads (usually none, sometimes few)
>
> *Problem*
>
> After a few weeks of running I cannot read any data from my cluster,
> because I have ReadTimeoutException like following:
>
> ERROR [Thrift:15] 2015-01-07 14:16:21,124 CustomTThreadPoolServer.java:219 - Error occurred during processing of message.
> com.google.common.util.concurrent.UncheckedExecutionException: java.lang.RuntimeException: org.apache.cassandra.exceptions.ReadTimeoutException: Operation timed out - received only 2 responses.
>
> To be precise it is not only problem in my cluster, The second one was
> described here: Cassandra GC takes 30 seconds and hangs node
> <http://stackoverflow.com/questions/27843538/cassandra-gc-takes-30-seconds-and-hangs-node> and
> I will try to use fix from CASSANDRA-6541
> <http://issues.apache.org/jira/browse/CASSANDRA-6541> as leshkin suggested
>
> *Diagnose *
>
> I tried to use some tools which were presented on
> http://rustyrazorblade.com/2014/09/cassandra-summit-recap-diagnosing-problems-in-production/
> by Jon Haddad and have some strange result.
>
>
> I tried to run same query in DC_A and DC_B with tracing enabled. Query is
> simple:
>
>    SELECT * FROM X.customer_events WHERE customer='1234567' AND
> utc_day=16447 AND bucket IN (1,2,3,4,5,6,7,8,9,10);
>
> Where table is defiied as following:
>
>   CREATE TABLE drev_maelstrom.customer_events (customer text,utc_day int, bucket
> int, event_time bigint, event_id blob, event_type int, event blob,
>
>   PRIMARY KEY ((customer, utc_day, bucket), event_time, event_id,
> event_type)[...]
>
> Results of the query:
>
> 1) In DC_B the query finished in less then a 0.22 of second . In DC_A more
> then 2.5 (~10 times longer). -> the problem is that bucket can be in range
> form -128 to 256
>
> 2) In DC_B it checked ~1000 SSTables with lines like:
>
>    Bloom filter allows skipping sstable 50372 [SharedPool-Worker-7] |
> 2015-01-12 13:51:49.467001 | 192.168.71.198 |           4782
>
> Where in DC_A it is:
>
>    Bloom filter allows skipping sstable 118886 [SharedPool-Worker-5] |
> 2015-01-12 14:01:39.520001 | 192.168.61.199 |          25527
>
> 3) Total records in both DC were same.
>
>
> *Question*
>
> The question is quite simple: how can I speed up DC_A - it is my primary
> DC, DC_B is mostly for backup, and there is a lot of network partitions
> between A and B.
>
> Maybe I should check something more, but I just don't have an idea what it
> should be.
>
>
>