You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Riccardo Ferrari <fe...@gmail.com> on 2018/09/12 09:59:13 UTC

Read timeouts when performing rolling restart

Hi list,

We are seeing the following behaviour when performing a rolling restart:

On the node I need to restart:
*  I run the 'nodetool drain'
* Then 'service cassandra restart'

so far so good. The load incerase on the other 5 nodes is negligible.
The node is generally out of service just for the time of the restart (ie.
cassandra.yml update)

When the node comes back up and switch on the native transport I start see
lots of read timeouts in our various services:

com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra timeout
during read query at consistency LOCAL_ONE (1 responses were required but
only 0 replica responded)

Indeed the restarting node have a huge peak on the system load, because of
hints and compactions, nevertheless I don't notice a load increase on the
other 5 nodes.

Specs:
6 nodes cluster on Cassandra 3.0.6
- keyspace RF=3

Java driver 3.5.1:
- DefaultRetryPolicy
- default LoadBalancingPolicy (that should be DCAwareRoundRobinPolicy)

QUESTIONS:
How come that a single node is impacting the whole cluster?
Is there a way to further delay the native transposrt startup?
Any hint on troubleshooting it further?

Thanks

Re: Read timeouts when performing rolling restart

Posted by Riccardo Ferrari <fe...@gmail.com>.

Allright, coming back after some homeworks. Thank you Alain

About hardware:
m1.xlarge 4vcpu 15GB ram
- 4 spinning disks configured in raid0

#About compactors:
- I 've moved them back to 2 concurrent compactors. I can say in general I
don't see more than 80-ish pending compactions (during compaction times).
This is true when using watch nodetool tpstats too
- Throughput is 16MB/s

#About Hints
When a node boots up I clearly see a spike on pending compactions (still
around 80-ish). During boot when it starts receiving hints the system load
grows to an unsastainalbe level (30+ )in the logs I get the message
"[HINTS|MUTATION] messages were dropped in the last 5000ms..."
Now, after tuning compactors I still see some dropped messages (all of them
MUTATION or. HINTS). On some nodes is as low 0 on other nodes as high as
32k. In particular out of 6 nodes there is one doing 32k dropped messages
one doing 16k and one few hundreds, and somehow the are always the same
nodes.

#About GC
I have moved all my nodes on CMS: Xms and Xmx 8G and Xmn4G. You already
helped on the JVM tuning. Despite G1 was doing pretty good CMS turned out
to be more consistent. Under heavy load G1 can stop longer than CMS. GC
pauses as seen on couple of nodes are between 200 and 430ms.


# Couple of notes:
I see some nodes with higher system load (not data) than other in
particular one of the nodes despite has 170+GB of data its system load
barely and rarely reach 2.0.
I recently adopted Reaper (Thanks TLP!). Out of my 2 keyspaces one is
running (repairing) just fine the second one that is bigger/older is simply
not progressing. Maybe this can give a hint on where to look into...

Thanks!

On Fri, Sep 14, 2018 at 11:54 AM, Alain RODRIGUEZ <ar...@gmail.com>
wrote:

> Hello Riccardo.
>
> Going to VPC, use GPFS and NTS all sounds very reasonable to me. As you
> said, that's another story. Good luck with this. Yet the current topology
> should also work well and I am wondering why the query does not find any
> other replica available.
>
> About your problem at hand:
>
> It's unclear to me at this point if the nodes are becoming unresponsive.
> My main thought on your first email was that you were facing some issue
> were due to the topology or to the client configuration you were missing
> replicas, but I cannot see what's wrong (if not authentication indeed, but
> you don't use it).
> Then I am thinking it might be indeed due to many nodes getting extremely
> busy at the moment of the restart (of any of the nodes), because of this:
>
> After rising the compactors to 4 I still see some dropped messages for
>> HINT and MUTATIONS. This happens during startup. Reason is "for internal
>> timeout". Maybe too many compactors?
>
>
> Some tuning information/hints:
>
> * The number of *concurrent_compactor* should be between 1/4 and 1/2 of
> the total number of cores and generally no more than 8. It should ideally
> never be equal to the number of CPU cores as we want power available to
> process requests at any moment.
> * Another common bottleneck is the disk throughput. If compactions are
> running too fast, it can harm as well. I would fix the number of
> concurrent_compactors as mentioned above and act on the compaction
> throughput instead
> * If hints are a problem, or rather saying, to make sure they are involved
> in the issue you see, why not disabling hints completely on all nodes and
> try a restart? Anything that can be disabled is an optimization. You do not
> need hinted handoff if you run a repair later on (or if you operate with a
> strong consistency and do not perform deletes for example). You can give
> this a try: https://github.com/apache/cassandra/blob/cassandra-3.0.
> 6/conf/cassandra.yaml#L44-L46.
> * Not as brutal, you can try slowing down the hint transfer speed:
> https://github.com/apache/cassandra/blob/cassandra-3.0.6/conf/
> cassandra.yaml#L57-L67.
> * Check for GC that would be induced by the pressure put by hints
> delivery, compactions and the first load of the memory on machine start.
> Any GC activity that would be shown in the logs?
> * As you are using AWS, tuning the phi_convict_threshold to 10-12 could
> help as well not making the node down (if that's what happens).
> * Do you see any specific part of the hardware being the bottleneck or
> bing especially strongly used during a restart? Maybe use:
> 'dstat -D <disk> -lvrn 10' (where <disk> is like 'xvdq'). I believe this
> command shows Bytes, not bits, thus ‘50M' is 50 MB or 400 Mb.
> * What hardware are you using?
> * Could you also run a 'watch -n1 -d "nodetool tpstats"' during the nodes
> restart as well and see what threads are 'PENDING' during the restart. For
> instance, if the flush_writter is pending, the next write to this table has
> to wait for the data to be flushed. It can be multiple things, but having
> an interactive view of the pending requests might lead you to the root
> cause of the issue.
>
> C*heers,
> -----------------------
> Alain Rodriguez - @arodream - alain@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> Le jeu. 13 sept. 2018 à 09:50, Riccardo Ferrari <fe...@gmail.com> a
> écrit :
>
>> Hi Shalom,
>>
>> It happens almost at every restart, either a single node or a rolling
>> one. I do agree with you that it is good, at least on my setup, to wait few
>> minutes to let the rebooted node to cool down before moving to the next.
>> The more I look at it the more I think is something coming from hint
>> dispatching, maybe I should try  something around hints throttling.
>>
>> Thanks!
>>
>> On Thu, Sep 13, 2018 at 8:55 AM, shalom sagges <sh...@gmail.com>
>> wrote:
>>
>>> Hi Riccardo,
>>>
>>> Does this issue occur when performing a single restart or after several
>>> restarts during a rolling restart (as mentioned in your original post)?
>>> We have a cluster that when performing a rolling restart, we prefer to
>>> wait ~10-15 minutes between each restart because we see an increase of GC
>>> for a few minutes.
>>> If we keep restarting the nodes quickly one after the other, the
>>> applications experience timeouts (probably due to GC and hints).
>>>
>>> Hope this helps!
>>>
>>> On Thu, Sep 13, 2018 at 2:20 AM Riccardo Ferrari <fe...@gmail.com>
>>> wrote:
>>>
>>>> A little update on the progress.
>>>>
>>>> First:
>>>> Thank you Thomas. I checked the code in the patch and briefly skimmed
>>>> through the 3.0.6 code. Yup it should be fixed.
>>>> Thank you Surbhi. At the moment we don't need authentication as the
>>>> instances are locked down.
>>>>
>>>> Now:
>>>> - Unfortunately the start_transport_native trick does not always work.
>>>> On some nodes works on other don't. What do I mean? I still experience
>>>> timeouts and dropped messages during startup.
>>>> - I realized that cutting the concurrent_compactors to 1 was not really
>>>> a good idea, minimum vlaue should be 2, currently testing 4 (that is the
>>>> min(n_cores, n_disks))
>>>> - After rising the compactors to 4 I still see some dropped messages
>>>> for HINT and MUTATIONS. This happens during startup. Reason is "for
>>>> internal timeout". Maybe too many compactors?
>>>>
>>>> Thanks!
>>>>
>>>>
>>>> On Wed, Sep 12, 2018 at 7:09 PM, Surbhi Gupta <surbhi.gupta01@gmail.com
>>>> > wrote:
>>>>
>>>>> Another thing to notice is :
>>>>>
>>>>> system_auth WITH replication = {'class': 'SimpleStrategy',
>>>>> 'replication_factor': '1'}
>>>>>
>>>>> system_auth has a replication factor of 1 and even if one node is down
>>>>> it may impact the system because of the replication factor.
>>>>>
>>>>>
>>>>>
>>>>> On Wed, 12 Sep 2018 at 09:46, Steinmaurer, Thomas <
>>>>> thomas.steinmaurer@dynatrace.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>>
>>>>>> I remember something that a client using the native protocol gets
>>>>>> notified too early by Cassandra being ready due to the following issue:
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/CASSANDRA-8236
>>>>>>
>>>>>>
>>>>>>
>>>>>> which looks similar, but above was marked as fixed in 2.2.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thomas
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Riccardo Ferrari <fe...@gmail.com>
>>>>>> *Sent:* Mittwoch, 12. September 2018 18:25
>>>>>> *To:* user@cassandra.apache.org
>>>>>> *Subject:* Re: Read timeouts when performing rolling restart
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi Alain,
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thank you for chiming in!
>>>>>>
>>>>>>
>>>>>>
>>>>>> I was thinking to perform the 'start_native_transport=false' test as
>>>>>> well and indeed the issue is not showing up. Starting the/a node with
>>>>>> native transport disabled and letting it cool down lead to no timeout
>>>>>> exceptions no dropped messages, simply a crystal clean startup. Agreed it
>>>>>> is a workaround
>>>>>>
>>>>>>
>>>>>>
>>>>>> # About upgrading:
>>>>>>
>>>>>> Yes, I desperately want to upgrade despite is a long and slow task.
>>>>>> Just reviewing all the changes from 3.0.6 to 3.0.17
>>>>>> is going to be a huge pain, top of your head, any breaking change I
>>>>>> should absolutely take care of reviewing ?
>>>>>>
>>>>>>
>>>>>>
>>>>>> # describecluster output: YES they agree on the same schema version
>>>>>>
>>>>>>
>>>>>>
>>>>>> # keyspaces:
>>>>>>
>>>>>> system WITH replication = {'class': 'LocalStrategy'}
>>>>>>
>>>>>> system_schema WITH replication = {'class': 'LocalStrategy'}
>>>>>>
>>>>>> system_auth WITH replication = {'class': 'SimpleStrategy',
>>>>>> 'replication_factor': '1'}
>>>>>>
>>>>>> system_distributed WITH replication = {'class': 'SimpleStrategy',
>>>>>> 'replication_factor': '3'}
>>>>>>
>>>>>> system_traces WITH replication = {'class': 'SimpleStrategy',
>>>>>> 'replication_factor': '2'}
>>>>>>
>>>>>>
>>>>>>
>>>>>> <custom1> WITH replication = {'class': 'SimpleStrategy',
>>>>>> 'replication_factor': '3'}
>>>>>>
>>>>>> <custom2>  WITH replication = {'class': 'SimpleStrategy',
>>>>>> 'replication_factor': '3'}
>>>>>>
>>>>>>
>>>>>>
>>>>>> # Snitch
>>>>>>
>>>>>> Ec2Snitch
>>>>>>
>>>>>>
>>>>>>
>>>>>> ## About Snitch and replication:
>>>>>>
>>>>>> - We have the default DC and all nodes are in the same RACK
>>>>>>
>>>>>> - We are planning to move to GossipingPropertyFileSnitch configuring
>>>>>> the cassandra-rackdc accortingly.
>>>>>>
>>>>>> -- This should be a transparent change, correct?
>>>>>>
>>>>>>
>>>>>>
>>>>>> - Once switched to GPFS, we plan to move to 'NetworkTopologyStrategy'
>>>>>> with 'us-xxxx' DC and replica counts as before
>>>>>>
>>>>>> - Then adding a new DC inside the VPC, but this is another story...
>>>>>>
>>>>>>
>>>>>>
>>>>>> Any concerns here ?
>>>>>>
>>>>>>
>>>>>>
>>>>>> # nodetool status <ks>
>>>>>>
>>>>>> --  Address         Load       Tokens       Owns (effective)  Host
>>>>>> ID                               Rack
>>>>>> UN  10.x.x.a  177 GB     256          50.3%
>>>>>> d8bfe4ad-8138-41fe-89a4-ee9a043095b5  rr
>>>>>> UN  10.x.x.b    152.46 GB  256          51.8%
>>>>>> 7888c077-346b-4e09-96b0-9f6376b8594f  rr
>>>>>> UN  10.x.x.c   159.59 GB  256          49.0%
>>>>>> 329b288e-c5b5-4b55-b75e-fbe9243e75fa  rr
>>>>>> UN  10.x.x.d  162.44 GB  256          49.3%
>>>>>> 07038c11-d200-46a0-9f6a-6e2465580fb1  rr
>>>>>> UN  10.x.x.e    174.9 GB   256          50.5%
>>>>>> c35b5d51-2d14-4334-9ffc-726f9dd8a214  rr
>>>>>> UN  10.x.x.f  194.71 GB  256          49.2%
>>>>>> f20f7a87-d5d2-4f38-a963-21e24167b8ac  rr
>>>>>>
>>>>>>
>>>>>>
>>>>>> # gossipinfo
>>>>>>
>>>>>> /10.x.x.a
>>>>>>   STATUS:827:NORMAL,-1350078789194251746
>>>>>>   LOAD:289986:1.90078037902E11
>>>>>>   SCHEMA:281088:af4461c3-d269-39bc-9d03-3566031c1e0a
>>>>>>   DC:6:<some-ec2-dc>
>>>>>>   RACK:8:rr
>>>>>>   RELEASE_VERSION:4:3.0.6
>>>>>>   SEVERITY:290040:0.5934718251228333
>>>>>>   NET_VERSION:1:10
>>>>>>   HOST_ID:2:d8bfe4ad-8138-41fe-89a4-ee9a043095b5
>>>>>>   RPC_READY:868:true
>>>>>>   TOKENS:826:<hidden>
>>>>>> /10.x.x.b
>>>>>>   STATUS:16:NORMAL,-1023229528754013265
>>>>>>   LOAD:7113:1.63730480619E11
>>>>>>   SCHEMA:10:af4461c3-d269-39bc-9d03-3566031c1e0a
>>>>>>   DC:6:<some-ec2-dc>
>>>>>>   RACK:8:rr
>>>>>>   RELEASE_VERSION:4:3.0.6
>>>>>>   SEVERITY:7274:0.5988024473190308
>>>>>>   NET_VERSION:1:10
>>>>>>   HOST_ID:2:7888c077-346b-4e09-96b0-9f6376b8594f
>>>>>>   TOKENS:15:<hidden>
>>>>>> /10.x.x.c
>>>>>>   STATUS:732:NORMAL,-1117172759238888547
>>>>>>   LOAD:245839:1.71409806942E11
>>>>>>   SCHEMA:237168:af4461c3-d269-39bc-9d03-3566031c1e0a
>>>>>>   DC:6:<some-ec2-dc>
>>>>>>   RACK:8:rr
>>>>>>   RELEASE_VERSION:4:3.0.6
>>>>>>   SEVERITY:245989:0.0
>>>>>>   NET_VERSION:1:10
>>>>>>   HOST_ID:2:329b288e-c5b5-4b55-b75e-fbe9243e75fa
>>>>>>   RPC_READY:763:true
>>>>>>   TOKENS:731:<hidden>
>>>>>> /10.x.x.d
>>>>>>   STATUS:14:NORMAL,-1004942496246544417
>>>>>>   LOAD:313125:1.74447964917E11
>>>>>>   SCHEMA:304268:af4461c3-d269-39bc-9d03-3566031c1e0a
>>>>>>   DC:6:<some-ec2-dc>
>>>>>>   RACK:8:rr
>>>>>>   RELEASE_VERSION:4:3.0.6
>>>>>>   SEVERITY:313215:0.25641027092933655
>>>>>>   NET_VERSION:1:10
>>>>>>   HOST_ID:2:07038c11-d200-46a0-9f6a-6e2465580fb1
>>>>>>   RPC_READY:56:true
>>>>>>   TOKENS:13:<hidden>
>>>>>> /10.x.x.e
>>>>>>   STATUS:520:NORMAL,-1058809960483771749
>>>>>>   LOAD:276118:1.87831573032E11
>>>>>>   SCHEMA:267327:af4461c3-d269-39bc-9d03-3566031c1e0a
>>>>>>   DC:6:<some-ec2-dc>
>>>>>>   RACK:8:rr
>>>>>>   RELEASE_VERSION:4:3.0.6
>>>>>>   SEVERITY:276217:0.32786884903907776
>>>>>>   NET_VERSION:1:10
>>>>>>   HOST_ID:2:c35b5d51-2d14-4334-9ffc-726f9dd8a214
>>>>>>   RPC_READY:550:true
>>>>>>   TOKENS:519:<hidden>
>>>>>> /10.x.x.f
>>>>>>   STATUS:1081:NORMAL,-1039671799603495012
>>>>>>   LOAD:239114:2.09082017545E11
>>>>>>   SCHEMA:230229:af4461c3-d269-39bc-9d03-3566031c1e0a
>>>>>>   DC:6:<some-ec2-dc>
>>>>>>   RACK:8:rr
>>>>>>   RELEASE_VERSION:4:3.0.6
>>>>>>   SEVERITY:239180:0.5665722489356995
>>>>>>   NET_VERSION:1:10
>>>>>>   HOST_ID:2:f20f7a87-d5d2-4f38-a963-21e24167b8ac
>>>>>>   RPC_READY:1118:true
>>>>>>   TOKENS:1080:<hidden>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ## About load and tokens:
>>>>>>
>>>>>> - While load is pretty even this does not apply to tokens, I guess we
>>>>>> have some table with uneven distribution. This should not be the case for
>>>>>> high load tabels as partition keys are are build with some 'id + <some time
>>>>>> format>'
>>>>>>
>>>>>> - I was not able to find some documentation about the numbers printed
>>>>>> next to LOAD, SCHEMA, SEVERITY, RPC_READY ... Is there any doc around ?
>>>>>>
>>>>>>
>>>>>>
>>>>>> # Tombstones
>>>>>>
>>>>>> No ERRORS, only WARN about a very specific table that we are aware
>>>>>> of. It is an append only table read by spark from a batch job. (I guess it
>>>>>> is a read_repair chance or DTCS misconfig)
>>>>>>
>>>>>>
>>>>>>
>>>>>> ## Closing note!
>>>>>>
>>>>>> We are on very only m1.xlarge 4 vcpu and raid0 (stripe) on the 4
>>>>>> spinning drives, some changes to the cassandra.yml:
>>>>>>
>>>>>>
>>>>>>
>>>>>> - dynamic_snitch: false
>>>>>>
>>>>>> - concurrent_reads: 48
>>>>>>
>>>>>> - concurrent_compactors: 1 (was 2)
>>>>>>
>>>>>> - disk_optimization_strategy: spinning
>>>>>>
>>>>>>
>>>>>>
>>>>>> I have some concerns about the number of concurrent_compactors, what
>>>>>> do you think?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>>> The contents of this e-mail are intended for the named addressee
>>>>>> only. It contains information that may be confidential. Unless you are the
>>>>>> named addressee or an authorized designee, you may not copy or use it, or
>>>>>> disclose it to anyone else. If you received it in error please notify us
>>>>>> immediately and then destroy it. Dynatrace Austria GmbH (registration
>>>>>> number FN 91482h) is a company registered in Linz whose registered office
>>>>>> is at 4040 Linz, Austria, Freistädterstraße 313
>>>>>>
>>>>>
>>>>
>>

Re: Read timeouts when performing rolling restart

Posted by Alain RODRIGUEZ <ar...@gmail.com>.

Hello Riccardo.

Going to VPC, use GPFS and NTS all sounds very reasonable to me. As you
said, that's another story. Good luck with this. Yet the current topology
should also work well and I am wondering why the query does not find any
other replica available.

About your problem at hand:

It's unclear to me at this point if the nodes are becoming unresponsive. My
main thought on your first email was that you were facing some issue were
due to the topology or to the client configuration you were missing
replicas, but I cannot see what's wrong (if not authentication indeed, but
you don't use it).
Then I am thinking it might be indeed due to many nodes getting extremely
busy at the moment of the restart (of any of the nodes), because of this:

After rising the compactors to 4 I still see some dropped messages for HINT
> and MUTATIONS. This happens during startup. Reason is "for internal
> timeout". Maybe too many compactors?


Some tuning information/hints:

* The number of *concurrent_compactor* should be between 1/4 and 1/2 of the
total number of cores and generally no more than 8. It should ideally never
be equal to the number of CPU cores as we want power available to process
requests at any moment.
* Another common bottleneck is the disk throughput. If compactions are
running too fast, it can harm as well. I would fix the number of
concurrent_compactors as mentioned above and act on the compaction
throughput instead
* If hints are a problem, or rather saying, to make sure they are involved
in the issue you see, why not disabling hints completely on all nodes and
try a restart? Anything that can be disabled is an optimization. You do not
need hinted handoff if you run a repair later on (or if you operate with a
strong consistency and do not perform deletes for example). You can give
this a try:
https://github.com/apache/cassandra/blob/cassandra-3.0.6/conf/cassandra.yaml#L44-L46
.
* Not as brutal, you can try slowing down the hint transfer speed:
https://github.com/apache/cassandra/blob/cassandra-3.0.6/conf/cassandra.yaml#L57-L67
.
* Check for GC that would be induced by the pressure put by hints delivery,
compactions and the first load of the memory on machine start. Any GC
activity that would be shown in the logs?
* As you are using AWS, tuning the phi_convict_threshold to 10-12 could
help as well not making the node down (if that's what happens).
* Do you see any specific part of the hardware being the bottleneck or bing
especially strongly used during a restart? Maybe use:
'dstat -D <disk> -lvrn 10' (where <disk> is like 'xvdq'). I believe this
command shows Bytes, not bits, thus ‘50M' is 50 MB or 400 Mb.
* What hardware are you using?
* Could you also run a 'watch -n1 -d "nodetool tpstats"' during the nodes
restart as well and see what threads are 'PENDING' during the restart. For
instance, if the flush_writter is pending, the next write to this table has
to wait for the data to be flushed. It can be multiple things, but having
an interactive view of the pending requests might lead you to the root
cause of the issue.

C*heers,
-----------------------
Alain Rodriguez - @arodream - alain@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le jeu. 13 sept. 2018 à 09:50, Riccardo Ferrari <fe...@gmail.com> a
écrit :

> Hi Shalom,
>
> It happens almost at every restart, either a single node or a rolling one.
> I do agree with you that it is good, at least on my setup, to wait few
> minutes to let the rebooted node to cool down before moving to the next.
> The more I look at it the more I think is something coming from hint
> dispatching, maybe I should try  something around hints throttling.
>
> Thanks!
>
> On Thu, Sep 13, 2018 at 8:55 AM, shalom sagges <sh...@gmail.com>
> wrote:
>
>> Hi Riccardo,
>>
>> Does this issue occur when performing a single restart or after several
>> restarts during a rolling restart (as mentioned in your original post)?
>> We have a cluster that when performing a rolling restart, we prefer to
>> wait ~10-15 minutes between each restart because we see an increase of GC
>> for a few minutes.
>> If we keep restarting the nodes quickly one after the other, the
>> applications experience timeouts (probably due to GC and hints).
>>
>> Hope this helps!
>>
>> On Thu, Sep 13, 2018 at 2:20 AM Riccardo Ferrari <fe...@gmail.com>
>> wrote:
>>
>>> A little update on the progress.
>>>
>>> First:
>>> Thank you Thomas. I checked the code in the patch and briefly skimmed
>>> through the 3.0.6 code. Yup it should be fixed.
>>> Thank you Surbhi. At the moment we don't need authentication as the
>>> instances are locked down.
>>>
>>> Now:
>>> - Unfortunately the start_transport_native trick does not always work.
>>> On some nodes works on other don't. What do I mean? I still experience
>>> timeouts and dropped messages during startup.
>>> - I realized that cutting the concurrent_compactors to 1 was not really
>>> a good idea, minimum vlaue should be 2, currently testing 4 (that is the
>>> min(n_cores, n_disks))
>>> - After rising the compactors to 4 I still see some dropped messages for
>>> HINT and MUTATIONS. This happens during startup. Reason is "for internal
>>> timeout". Maybe too many compactors?
>>>
>>> Thanks!
>>>
>>>
>>> On Wed, Sep 12, 2018 at 7:09 PM, Surbhi Gupta <su...@gmail.com>
>>> wrote:
>>>
>>>> Another thing to notice is :
>>>>
>>>> system_auth WITH replication = {'class': 'SimpleStrategy',
>>>> 'replication_factor': '1'}
>>>>
>>>> system_auth has a replication factor of 1 and even if one node is down
>>>> it may impact the system because of the replication factor.
>>>>
>>>>
>>>>
>>>> On Wed, 12 Sep 2018 at 09:46, Steinmaurer, Thomas <
>>>> thomas.steinmaurer@dynatrace.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>> I remember something that a client using the native protocol gets
>>>>> notified too early by Cassandra being ready due to the following issue:
>>>>>
>>>>> https://issues.apache.org/jira/browse/CASSANDRA-8236
>>>>>
>>>>>
>>>>>
>>>>> which looks similar, but above was marked as fixed in 2.2.
>>>>>
>>>>>
>>>>>
>>>>> Thomas
>>>>>
>>>>>
>>>>>
>>>>> *From:* Riccardo Ferrari <fe...@gmail.com>
>>>>> *Sent:* Mittwoch, 12. September 2018 18:25
>>>>> *To:* user@cassandra.apache.org
>>>>> *Subject:* Re: Read timeouts when performing rolling restart
>>>>>
>>>>>
>>>>>
>>>>> Hi Alain,
>>>>>
>>>>>
>>>>>
>>>>> Thank you for chiming in!
>>>>>
>>>>>
>>>>>
>>>>> I was thinking to perform the 'start_native_transport=false' test as
>>>>> well and indeed the issue is not showing up. Starting the/a node with
>>>>> native transport disabled and letting it cool down lead to no timeout
>>>>> exceptions no dropped messages, simply a crystal clean startup. Agreed it
>>>>> is a workaround
>>>>>
>>>>>
>>>>>
>>>>> # About upgrading:
>>>>>
>>>>> Yes, I desperately want to upgrade despite is a long and slow task.
>>>>> Just reviewing all the changes from 3.0.6 to 3.0.17
>>>>> is going to be a huge pain, top of your head, any breaking change I
>>>>> should absolutely take care of reviewing ?
>>>>>
>>>>>
>>>>>
>>>>> # describecluster output: YES they agree on the same schema version
>>>>>
>>>>>
>>>>>
>>>>> # keyspaces:
>>>>>
>>>>> system WITH replication = {'class': 'LocalStrategy'}
>>>>>
>>>>> system_schema WITH replication = {'class': 'LocalStrategy'}
>>>>>
>>>>> system_auth WITH replication = {'class': 'SimpleStrategy',
>>>>> 'replication_factor': '1'}
>>>>>
>>>>> system_distributed WITH replication = {'class': 'SimpleStrategy',
>>>>> 'replication_factor': '3'}
>>>>>
>>>>> system_traces WITH replication = {'class': 'SimpleStrategy',
>>>>> 'replication_factor': '2'}
>>>>>
>>>>>
>>>>>
>>>>> <custom1> WITH replication = {'class': 'SimpleStrategy',
>>>>> 'replication_factor': '3'}
>>>>>
>>>>> <custom2>  WITH replication = {'class': 'SimpleStrategy',
>>>>> 'replication_factor': '3'}
>>>>>
>>>>>
>>>>>
>>>>> # Snitch
>>>>>
>>>>> Ec2Snitch
>>>>>
>>>>>
>>>>>
>>>>> ## About Snitch and replication:
>>>>>
>>>>> - We have the default DC and all nodes are in the same RACK
>>>>>
>>>>> - We are planning to move to GossipingPropertyFileSnitch configuring
>>>>> the cassandra-rackdc accortingly.
>>>>>
>>>>> -- This should be a transparent change, correct?
>>>>>
>>>>>
>>>>>
>>>>> - Once switched to GPFS, we plan to move to 'NetworkTopologyStrategy'
>>>>> with 'us-xxxx' DC and replica counts as before
>>>>>
>>>>> - Then adding a new DC inside the VPC, but this is another story...
>>>>>
>>>>>
>>>>>
>>>>> Any concerns here ?
>>>>>
>>>>>
>>>>>
>>>>> # nodetool status <ks>
>>>>>
>>>>> --  Address         Load       Tokens       Owns (effective)  Host
>>>>> ID                               Rack
>>>>> UN  10.x.x.a  177 GB     256          50.3%
>>>>> d8bfe4ad-8138-41fe-89a4-ee9a043095b5  rr
>>>>> UN  10.x.x.b    152.46 GB  256          51.8%
>>>>> 7888c077-346b-4e09-96b0-9f6376b8594f  rr
>>>>> UN  10.x.x.c   159.59 GB  256          49.0%
>>>>> 329b288e-c5b5-4b55-b75e-fbe9243e75fa  rr
>>>>> UN  10.x.x.d  162.44 GB  256          49.3%
>>>>> 07038c11-d200-46a0-9f6a-6e2465580fb1  rr
>>>>> UN  10.x.x.e    174.9 GB   256          50.5%
>>>>> c35b5d51-2d14-4334-9ffc-726f9dd8a214  rr
>>>>> UN  10.x.x.f  194.71 GB  256          49.2%
>>>>> f20f7a87-d5d2-4f38-a963-21e24167b8ac  rr
>>>>>
>>>>>
>>>>>
>>>>> # gossipinfo
>>>>>
>>>>> /10.x.x.a
>>>>>   STATUS:827:NORMAL,-1350078789194251746
>>>>>   LOAD:289986:1.90078037902E11
>>>>>   SCHEMA:281088:af4461c3-d269-39bc-9d03-3566031c1e0a
>>>>>   DC:6:<some-ec2-dc>
>>>>>   RACK:8:rr
>>>>>   RELEASE_VERSION:4:3.0.6
>>>>>   SEVERITY:290040:0.5934718251228333
>>>>>   NET_VERSION:1:10
>>>>>   HOST_ID:2:d8bfe4ad-8138-41fe-89a4-ee9a043095b5
>>>>>   RPC_READY:868:true
>>>>>   TOKENS:826:<hidden>
>>>>> /10.x.x.b
>>>>>   STATUS:16:NORMAL,-1023229528754013265
>>>>>   LOAD:7113:1.63730480619E11
>>>>>   SCHEMA:10:af4461c3-d269-39bc-9d03-3566031c1e0a
>>>>>   DC:6:<some-ec2-dc>
>>>>>   RACK:8:rr
>>>>>   RELEASE_VERSION:4:3.0.6
>>>>>   SEVERITY:7274:0.5988024473190308
>>>>>   NET_VERSION:1:10
>>>>>   HOST_ID:2:7888c077-346b-4e09-96b0-9f6376b8594f
>>>>>   TOKENS:15:<hidden>
>>>>> /10.x.x.c
>>>>>   STATUS:732:NORMAL,-1117172759238888547
>>>>>   LOAD:245839:1.71409806942E11
>>>>>   SCHEMA:237168:af4461c3-d269-39bc-9d03-3566031c1e0a
>>>>>   DC:6:<some-ec2-dc>
>>>>>   RACK:8:rr
>>>>>   RELEASE_VERSION:4:3.0.6
>>>>>   SEVERITY:245989:0.0
>>>>>   NET_VERSION:1:10
>>>>>   HOST_ID:2:329b288e-c5b5-4b55-b75e-fbe9243e75fa
>>>>>   RPC_READY:763:true
>>>>>   TOKENS:731:<hidden>
>>>>> /10.x.x.d
>>>>>   STATUS:14:NORMAL,-1004942496246544417
>>>>>   LOAD:313125:1.74447964917E11
>>>>>   SCHEMA:304268:af4461c3-d269-39bc-9d03-3566031c1e0a
>>>>>   DC:6:<some-ec2-dc>
>>>>>   RACK:8:rr
>>>>>   RELEASE_VERSION:4:3.0.6
>>>>>   SEVERITY:313215:0.25641027092933655
>>>>>   NET_VERSION:1:10
>>>>>   HOST_ID:2:07038c11-d200-46a0-9f6a-6e2465580fb1
>>>>>   RPC_READY:56:true
>>>>>   TOKENS:13:<hidden>
>>>>> /10.x.x.e
>>>>>   STATUS:520:NORMAL,-1058809960483771749
>>>>>   LOAD:276118:1.87831573032E11
>>>>>   SCHEMA:267327:af4461c3-d269-39bc-9d03-3566031c1e0a
>>>>>   DC:6:<some-ec2-dc>
>>>>>   RACK:8:rr
>>>>>   RELEASE_VERSION:4:3.0.6
>>>>>   SEVERITY:276217:0.32786884903907776
>>>>>   NET_VERSION:1:10
>>>>>   HOST_ID:2:c35b5d51-2d14-4334-9ffc-726f9dd8a214
>>>>>   RPC_READY:550:true
>>>>>   TOKENS:519:<hidden>
>>>>> /10.x.x.f
>>>>>   STATUS:1081:NORMAL,-1039671799603495012
>>>>>   LOAD:239114:2.09082017545E11
>>>>>   SCHEMA:230229:af4461c3-d269-39bc-9d03-3566031c1e0a
>>>>>   DC:6:<some-ec2-dc>
>>>>>   RACK:8:rr
>>>>>   RELEASE_VERSION:4:3.0.6
>>>>>   SEVERITY:239180:0.5665722489356995
>>>>>   NET_VERSION:1:10
>>>>>   HOST_ID:2:f20f7a87-d5d2-4f38-a963-21e24167b8ac
>>>>>   RPC_READY:1118:true
>>>>>   TOKENS:1080:<hidden>
>>>>>
>>>>>
>>>>>
>>>>> ## About load and tokens:
>>>>>
>>>>> - While load is pretty even this does not apply to tokens, I guess we
>>>>> have some table with uneven distribution. This should not be the case for
>>>>> high load tabels as partition keys are are build with some 'id + <some time
>>>>> format>'
>>>>>
>>>>> - I was not able to find some documentation about the numbers printed
>>>>> next to LOAD, SCHEMA, SEVERITY, RPC_READY ... Is there any doc around ?
>>>>>
>>>>>
>>>>>
>>>>> # Tombstones
>>>>>
>>>>> No ERRORS, only WARN about a very specific table that we are aware of.
>>>>> It is an append only table read by spark from a batch job. (I guess it is a
>>>>> read_repair chance or DTCS misconfig)
>>>>>
>>>>>
>>>>>
>>>>> ## Closing note!
>>>>>
>>>>> We are on very only m1.xlarge 4 vcpu and raid0 (stripe) on the 4
>>>>> spinning drives, some changes to the cassandra.yml:
>>>>>
>>>>>
>>>>>
>>>>> - dynamic_snitch: false
>>>>>
>>>>> - concurrent_reads: 48
>>>>>
>>>>> - concurrent_compactors: 1 (was 2)
>>>>>
>>>>> - disk_optimization_strategy: spinning
>>>>>
>>>>>
>>>>>
>>>>> I have some concerns about the number of concurrent_compactors, what
>>>>> do you think?
>>>>>
>>>>>
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>> The contents of this e-mail are intended for the named addressee only.
>>>>> It contains information that may be confidential. Unless you are the named
>>>>> addressee or an authorized designee, you may not copy or use it, or
>>>>> disclose it to anyone else. If you received it in error please notify us
>>>>> immediately and then destroy it. Dynatrace Austria GmbH (registration
>>>>> number FN 91482h) is a company registered in Linz whose registered office
>>>>> is at 4040 Linz, Austria, Freistädterstraße 313
>>>>>
>>>>
>>>
>

Re: Read timeouts when performing rolling restart

Posted by Riccardo Ferrari <fe...@gmail.com>.

Hi Shalom,

It happens almost at every restart, either a single node or a rolling one.
I do agree with you that it is good, at least on my setup, to wait few
minutes to let the rebooted node to cool down before moving to the next.
The more I look at it the more I think is something coming from hint
dispatching, maybe I should try  something around hints throttling.

Thanks!

On Thu, Sep 13, 2018 at 8:55 AM, shalom sagges <sh...@gmail.com>
wrote:

> Hi Riccardo,
>
> Does this issue occur when performing a single restart or after several
> restarts during a rolling restart (as mentioned in your original post)?
> We have a cluster that when performing a rolling restart, we prefer to
> wait ~10-15 minutes between each restart because we see an increase of GC
> for a few minutes.
> If we keep restarting the nodes quickly one after the other, the
> applications experience timeouts (probably due to GC and hints).
>
> Hope this helps!
>
> On Thu, Sep 13, 2018 at 2:20 AM Riccardo Ferrari <fe...@gmail.com>
> wrote:
>
>> A little update on the progress.
>>
>> First:
>> Thank you Thomas. I checked the code in the patch and briefly skimmed
>> through the 3.0.6 code. Yup it should be fixed.
>> Thank you Surbhi. At the moment we don't need authentication as the
>> instances are locked down.
>>
>> Now:
>> - Unfortunately the start_transport_native trick does not always work. On
>> some nodes works on other don't. What do I mean? I still experience
>> timeouts and dropped messages during startup.
>> - I realized that cutting the concurrent_compactors to 1 was not really a
>> good idea, minimum vlaue should be 2, currently testing 4 (that is the
>> min(n_cores, n_disks))
>> - After rising the compactors to 4 I still see some dropped messages for
>> HINT and MUTATIONS. This happens during startup. Reason is "for internal
>> timeout". Maybe too many compactors?
>>
>> Thanks!
>>
>>
>> On Wed, Sep 12, 2018 at 7:09 PM, Surbhi Gupta <su...@gmail.com>
>> wrote:
>>
>>> Another thing to notice is :
>>>
>>> system_auth WITH replication = {'class': 'SimpleStrategy',
>>> 'replication_factor': '1'}
>>>
>>> system_auth has a replication factor of 1 and even if one node is down
>>> it may impact the system because of the replication factor.
>>>
>>>
>>>
>>> On Wed, 12 Sep 2018 at 09:46, Steinmaurer, Thomas <
>>> thomas.steinmaurer@dynatrace.com> wrote:
>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> I remember something that a client using the native protocol gets
>>>> notified too early by Cassandra being ready due to the following issue:
>>>>
>>>> https://issues.apache.org/jira/browse/CASSANDRA-8236
>>>>
>>>>
>>>>
>>>> which looks similar, but above was marked as fixed in 2.2.
>>>>
>>>>
>>>>
>>>> Thomas
>>>>
>>>>
>>>>
>>>> *From:* Riccardo Ferrari <fe...@gmail.com>
>>>> *Sent:* Mittwoch, 12. September 2018 18:25
>>>> *To:* user@cassandra.apache.org
>>>> *Subject:* Re: Read timeouts when performing rolling restart
>>>>
>>>>
>>>>
>>>> Hi Alain,
>>>>
>>>>
>>>>
>>>> Thank you for chiming in!
>>>>
>>>>
>>>>
>>>> I was thinking to perform the 'start_native_transport=false' test as
>>>> well and indeed the issue is not showing up. Starting the/a node with
>>>> native transport disabled and letting it cool down lead to no timeout
>>>> exceptions no dropped messages, simply a crystal clean startup. Agreed it
>>>> is a workaround
>>>>
>>>>
>>>>
>>>> # About upgrading:
>>>>
>>>> Yes, I desperately want to upgrade despite is a long and slow task.
>>>> Just reviewing all the changes from 3.0.6 to 3.0.17
>>>> is going to be a huge pain, top of your head, any breaking change I
>>>> should absolutely take care of reviewing ?
>>>>
>>>>
>>>>
>>>> # describecluster output: YES they agree on the same schema version
>>>>
>>>>
>>>>
>>>> # keyspaces:
>>>>
>>>> system WITH replication = {'class': 'LocalStrategy'}
>>>>
>>>> system_schema WITH replication = {'class': 'LocalStrategy'}
>>>>
>>>> system_auth WITH replication = {'class': 'SimpleStrategy',
>>>> 'replication_factor': '1'}
>>>>
>>>> system_distributed WITH replication = {'class': 'SimpleStrategy',
>>>> 'replication_factor': '3'}
>>>>
>>>> system_traces WITH replication = {'class': 'SimpleStrategy',
>>>> 'replication_factor': '2'}
>>>>
>>>>
>>>>
>>>> <custom1> WITH replication = {'class': 'SimpleStrategy',
>>>> 'replication_factor': '3'}
>>>>
>>>> <custom2>  WITH replication = {'class': 'SimpleStrategy',
>>>> 'replication_factor': '3'}
>>>>
>>>>
>>>>
>>>> # Snitch
>>>>
>>>> Ec2Snitch
>>>>
>>>>
>>>>
>>>> ## About Snitch and replication:
>>>>
>>>> - We have the default DC and all nodes are in the same RACK
>>>>
>>>> - We are planning to move to GossipingPropertyFileSnitch configuring
>>>> the cassandra-rackdc accortingly.
>>>>
>>>> -- This should be a transparent change, correct?
>>>>
>>>>
>>>>
>>>> - Once switched to GPFS, we plan to move to 'NetworkTopologyStrategy'
>>>> with 'us-xxxx' DC and replica counts as before
>>>>
>>>> - Then adding a new DC inside the VPC, but this is another story...
>>>>
>>>>
>>>>
>>>> Any concerns here ?
>>>>
>>>>
>>>>
>>>> # nodetool status <ks>
>>>>
>>>> --  Address         Load       Tokens       Owns (effective)  Host
>>>> ID                               Rack
>>>> UN  10.x.x.a  177 GB     256          50.3%
>>>> d8bfe4ad-8138-41fe-89a4-ee9a043095b5  rr
>>>> UN  10.x.x.b    152.46 GB  256          51.8%
>>>> 7888c077-346b-4e09-96b0-9f6376b8594f  rr
>>>> UN  10.x.x.c   159.59 GB  256          49.0%
>>>> 329b288e-c5b5-4b55-b75e-fbe9243e75fa  rr
>>>> UN  10.x.x.d  162.44 GB  256          49.3%
>>>> 07038c11-d200-46a0-9f6a-6e2465580fb1  rr
>>>> UN  10.x.x.e    174.9 GB   256          50.5%
>>>> c35b5d51-2d14-4334-9ffc-726f9dd8a214  rr
>>>> UN  10.x.x.f  194.71 GB  256          49.2%
>>>> f20f7a87-d5d2-4f38-a963-21e24167b8ac  rr
>>>>
>>>>
>>>>
>>>> # gossipinfo
>>>>
>>>> /10.x.x.a
>>>>   STATUS:827:NORMAL,-1350078789194251746
>>>>   LOAD:289986:1.90078037902E11
>>>>   SCHEMA:281088:af4461c3-d269-39bc-9d03-3566031c1e0a
>>>>   DC:6:<some-ec2-dc>
>>>>   RACK:8:rr
>>>>   RELEASE_VERSION:4:3.0.6
>>>>   SEVERITY:290040:0.5934718251228333
>>>>   NET_VERSION:1:10
>>>>   HOST_ID:2:d8bfe4ad-8138-41fe-89a4-ee9a043095b5
>>>>   RPC_READY:868:true
>>>>   TOKENS:826:<hidden>
>>>> /10.x.x.b
>>>>   STATUS:16:NORMAL,-1023229528754013265
>>>>   LOAD:7113:1.63730480619E11
>>>>   SCHEMA:10:af4461c3-d269-39bc-9d03-3566031c1e0a
>>>>   DC:6:<some-ec2-dc>
>>>>   RACK:8:rr
>>>>   RELEASE_VERSION:4:3.0.6
>>>>   SEVERITY:7274:0.5988024473190308
>>>>   NET_VERSION:1:10
>>>>   HOST_ID:2:7888c077-346b-4e09-96b0-9f6376b8594f
>>>>   TOKENS:15:<hidden>
>>>> /10.x.x.c
>>>>   STATUS:732:NORMAL,-1117172759238888547
>>>>   LOAD:245839:1.71409806942E11
>>>>   SCHEMA:237168:af4461c3-d269-39bc-9d03-3566031c1e0a
>>>>   DC:6:<some-ec2-dc>
>>>>   RACK:8:rr
>>>>   RELEASE_VERSION:4:3.0.6
>>>>   SEVERITY:245989:0.0
>>>>   NET_VERSION:1:10
>>>>   HOST_ID:2:329b288e-c5b5-4b55-b75e-fbe9243e75fa
>>>>   RPC_READY:763:true
>>>>   TOKENS:731:<hidden>
>>>> /10.x.x.d
>>>>   STATUS:14:NORMAL,-1004942496246544417
>>>>   LOAD:313125:1.74447964917E11
>>>>   SCHEMA:304268:af4461c3-d269-39bc-9d03-3566031c1e0a
>>>>   DC:6:<some-ec2-dc>
>>>>   RACK:8:rr
>>>>   RELEASE_VERSION:4:3.0.6
>>>>   SEVERITY:313215:0.25641027092933655
>>>>   NET_VERSION:1:10
>>>>   HOST_ID:2:07038c11-d200-46a0-9f6a-6e2465580fb1
>>>>   RPC_READY:56:true
>>>>   TOKENS:13:<hidden>
>>>> /10.x.x.e
>>>>   STATUS:520:NORMAL,-1058809960483771749
>>>>   LOAD:276118:1.87831573032E11
>>>>   SCHEMA:267327:af4461c3-d269-39bc-9d03-3566031c1e0a
>>>>   DC:6:<some-ec2-dc>
>>>>   RACK:8:rr
>>>>   RELEASE_VERSION:4:3.0.6
>>>>   SEVERITY:276217:0.32786884903907776
>>>>   NET_VERSION:1:10
>>>>   HOST_ID:2:c35b5d51-2d14-4334-9ffc-726f9dd8a214
>>>>   RPC_READY:550:true
>>>>   TOKENS:519:<hidden>
>>>> /10.x.x.f
>>>>   STATUS:1081:NORMAL,-1039671799603495012
>>>>   LOAD:239114:2.09082017545E11
>>>>   SCHEMA:230229:af4461c3-d269-39bc-9d03-3566031c1e0a
>>>>   DC:6:<some-ec2-dc>
>>>>   RACK:8:rr
>>>>   RELEASE_VERSION:4:3.0.6
>>>>   SEVERITY:239180:0.5665722489356995
>>>>   NET_VERSION:1:10
>>>>   HOST_ID:2:f20f7a87-d5d2-4f38-a963-21e24167b8ac
>>>>   RPC_READY:1118:true
>>>>   TOKENS:1080:<hidden>
>>>>
>>>>
>>>>
>>>> ## About load and tokens:
>>>>
>>>> - While load is pretty even this does not apply to tokens, I guess we
>>>> have some table with uneven distribution. This should not be the case for
>>>> high load tabels as partition keys are are build with some 'id + <some time
>>>> format>'
>>>>
>>>> - I was not able to find some documentation about the numbers printed
>>>> next to LOAD, SCHEMA, SEVERITY, RPC_READY ... Is there any doc around ?
>>>>
>>>>
>>>>
>>>> # Tombstones
>>>>
>>>> No ERRORS, only WARN about a very specific table that we are aware of.
>>>> It is an append only table read by spark from a batch job. (I guess it is a
>>>> read_repair chance or DTCS misconfig)
>>>>
>>>>
>>>>
>>>> ## Closing note!
>>>>
>>>> We are on very only m1.xlarge 4 vcpu and raid0 (stripe) on the 4
>>>> spinning drives, some changes to the cassandra.yml:
>>>>
>>>>
>>>>
>>>> - dynamic_snitch: false
>>>>
>>>> - concurrent_reads: 48
>>>>
>>>> - concurrent_compactors: 1 (was 2)
>>>>
>>>> - disk_optimization_strategy: spinning
>>>>
>>>>
>>>>
>>>> I have some concerns about the number of concurrent_compactors, what do
>>>> you think?
>>>>
>>>>
>>>>
>>>> Thanks!
>>>>
>>>>
>>>> The contents of this e-mail are intended for the named addressee only.
>>>> It contains information that may be confidential. Unless you are the named
>>>> addressee or an authorized designee, you may not copy or use it, or
>>>> disclose it to anyone else. If you received it in error please notify us
>>>> immediately and then destroy it. Dynatrace Austria GmbH (registration
>>>> number FN 91482h) is a company registered in Linz whose registered office
>>>> is at 4040 Linz, Austria, Freistädterstraße 313
>>>>
>>>
>>

Re: Read timeouts when performing rolling restart

Posted by shalom sagges <sh...@gmail.com>.

Hi Riccardo,

Does this issue occur when performing a single restart or after several
restarts during a rolling restart (as mentioned in your original post)?
We have a cluster that when performing a rolling restart, we prefer to wait
~10-15 minutes between each restart because we see an increase of GC for a
few minutes.
If we keep restarting the nodes quickly one after the other, the
applications experience timeouts (probably due to GC and hints).

Hope this helps!

On Thu, Sep 13, 2018 at 2:20 AM Riccardo Ferrari <fe...@gmail.com> wrote:

> A little update on the progress.
>
> First:
> Thank you Thomas. I checked the code in the patch and briefly skimmed
> through the 3.0.6 code. Yup it should be fixed.
> Thank you Surbhi. At the moment we don't need authentication as the
> instances are locked down.
>
> Now:
> - Unfortunately the start_transport_native trick does not always work. On
> some nodes works on other don't. What do I mean? I still experience
> timeouts and dropped messages during startup.
> - I realized that cutting the concurrent_compactors to 1 was not really a
> good idea, minimum vlaue should be 2, currently testing 4 (that is the
> min(n_cores, n_disks))
> - After rising the compactors to 4 I still see some dropped messages for
> HINT and MUTATIONS. This happens during startup. Reason is "for internal
> timeout". Maybe too many compactors?
>
> Thanks!
>
>
> On Wed, Sep 12, 2018 at 7:09 PM, Surbhi Gupta <su...@gmail.com>
> wrote:
>
>> Another thing to notice is :
>>
>> system_auth WITH replication = {'class': 'SimpleStrategy',
>> 'replication_factor': '1'}
>>
>> system_auth has a replication factor of 1 and even if one node is down it
>> may impact the system because of the replication factor.
>>
>>
>>
>> On Wed, 12 Sep 2018 at 09:46, Steinmaurer, Thomas <
>> thomas.steinmaurer@dynatrace.com> wrote:
>>
>>> Hi,
>>>
>>>
>>>
>>> I remember something that a client using the native protocol gets
>>> notified too early by Cassandra being ready due to the following issue:
>>>
>>> https://issues.apache.org/jira/browse/CASSANDRA-8236
>>>
>>>
>>>
>>> which looks similar, but above was marked as fixed in 2.2.
>>>
>>>
>>>
>>> Thomas
>>>
>>>
>>>
>>> *From:* Riccardo Ferrari <fe...@gmail.com>
>>> *Sent:* Mittwoch, 12. September 2018 18:25
>>> *To:* user@cassandra.apache.org
>>> *Subject:* Re: Read timeouts when performing rolling restart
>>>
>>>
>>>
>>> Hi Alain,
>>>
>>>
>>>
>>> Thank you for chiming in!
>>>
>>>
>>>
>>> I was thinking to perform the 'start_native_transport=false' test as
>>> well and indeed the issue is not showing up. Starting the/a node with
>>> native transport disabled and letting it cool down lead to no timeout
>>> exceptions no dropped messages, simply a crystal clean startup. Agreed it
>>> is a workaround
>>>
>>>
>>>
>>> # About upgrading:
>>>
>>> Yes, I desperately want to upgrade despite is a long and slow task. Just
>>> reviewing all the changes from 3.0.6 to 3.0.17
>>> is going to be a huge pain, top of your head, any breaking change I
>>> should absolutely take care of reviewing ?
>>>
>>>
>>>
>>> # describecluster output: YES they agree on the same schema version
>>>
>>>
>>>
>>> # keyspaces:
>>>
>>> system WITH replication = {'class': 'LocalStrategy'}
>>>
>>> system_schema WITH replication = {'class': 'LocalStrategy'}
>>>
>>> system_auth WITH replication = {'class': 'SimpleStrategy',
>>> 'replication_factor': '1'}
>>>
>>> system_distributed WITH replication = {'class': 'SimpleStrategy',
>>> 'replication_factor': '3'}
>>>
>>> system_traces WITH replication = {'class': 'SimpleStrategy',
>>> 'replication_factor': '2'}
>>>
>>>
>>>
>>> <custom1> WITH replication = {'class': 'SimpleStrategy',
>>> 'replication_factor': '3'}
>>>
>>> <custom2>  WITH replication = {'class': 'SimpleStrategy',
>>> 'replication_factor': '3'}
>>>
>>>
>>>
>>> # Snitch
>>>
>>> Ec2Snitch
>>>
>>>
>>>
>>> ## About Snitch and replication:
>>>
>>> - We have the default DC and all nodes are in the same RACK
>>>
>>> - We are planning to move to GossipingPropertyFileSnitch configuring the
>>> cassandra-rackdc accortingly.
>>>
>>> -- This should be a transparent change, correct?
>>>
>>>
>>>
>>> - Once switched to GPFS, we plan to move to 'NetworkTopologyStrategy'
>>> with 'us-xxxx' DC and replica counts as before
>>>
>>> - Then adding a new DC inside the VPC, but this is another story...
>>>
>>>
>>>
>>> Any concerns here ?
>>>
>>>
>>>
>>> # nodetool status <ks>
>>>
>>> --  Address         Load       Tokens       Owns (effective)  Host
>>> ID                               Rack
>>> UN  10.x.x.a  177 GB     256          50.3%
>>> d8bfe4ad-8138-41fe-89a4-ee9a043095b5  rr
>>> UN  10.x.x.b    152.46 GB  256          51.8%
>>> 7888c077-346b-4e09-96b0-9f6376b8594f  rr
>>> UN  10.x.x.c   159.59 GB  256          49.0%
>>> 329b288e-c5b5-4b55-b75e-fbe9243e75fa  rr
>>> UN  10.x.x.d  162.44 GB  256          49.3%
>>> 07038c11-d200-46a0-9f6a-6e2465580fb1  rr
>>> UN  10.x.x.e    174.9 GB   256          50.5%
>>> c35b5d51-2d14-4334-9ffc-726f9dd8a214  rr
>>> UN  10.x.x.f  194.71 GB  256          49.2%
>>> f20f7a87-d5d2-4f38-a963-21e24167b8ac  rr
>>>
>>>
>>>
>>> # gossipinfo
>>>
>>> /10.x.x.a
>>>   STATUS:827:NORMAL,-1350078789194251746
>>>   LOAD:289986:1.90078037902E11
>>>   SCHEMA:281088:af4461c3-d269-39bc-9d03-3566031c1e0a
>>>   DC:6:<some-ec2-dc>
>>>   RACK:8:rr
>>>   RELEASE_VERSION:4:3.0.6
>>>   SEVERITY:290040:0.5934718251228333
>>>   NET_VERSION:1:10
>>>   HOST_ID:2:d8bfe4ad-8138-41fe-89a4-ee9a043095b5
>>>   RPC_READY:868:true
>>>   TOKENS:826:<hidden>
>>> /10.x.x.b
>>>   STATUS:16:NORMAL,-1023229528754013265
>>>   LOAD:7113:1.63730480619E11
>>>   SCHEMA:10:af4461c3-d269-39bc-9d03-3566031c1e0a
>>>   DC:6:<some-ec2-dc>
>>>   RACK:8:rr
>>>   RELEASE_VERSION:4:3.0.6
>>>   SEVERITY:7274:0.5988024473190308
>>>   NET_VERSION:1:10
>>>   HOST_ID:2:7888c077-346b-4e09-96b0-9f6376b8594f
>>>   TOKENS:15:<hidden>
>>> /10.x.x.c
>>>   STATUS:732:NORMAL,-1117172759238888547
>>>   LOAD:245839:1.71409806942E11
>>>   SCHEMA:237168:af4461c3-d269-39bc-9d03-3566031c1e0a
>>>   DC:6:<some-ec2-dc>
>>>   RACK:8:rr
>>>   RELEASE_VERSION:4:3.0.6
>>>   SEVERITY:245989:0.0
>>>   NET_VERSION:1:10
>>>   HOST_ID:2:329b288e-c5b5-4b55-b75e-fbe9243e75fa
>>>   RPC_READY:763:true
>>>   TOKENS:731:<hidden>
>>> /10.x.x.d
>>>   STATUS:14:NORMAL,-1004942496246544417
>>>   LOAD:313125:1.74447964917E11
>>>   SCHEMA:304268:af4461c3-d269-39bc-9d03-3566031c1e0a
>>>   DC:6:<some-ec2-dc>
>>>   RACK:8:rr
>>>   RELEASE_VERSION:4:3.0.6
>>>   SEVERITY:313215:0.25641027092933655
>>>   NET_VERSION:1:10
>>>   HOST_ID:2:07038c11-d200-46a0-9f6a-6e2465580fb1
>>>   RPC_READY:56:true
>>>   TOKENS:13:<hidden>
>>> /10.x.x.e
>>>   STATUS:520:NORMAL,-1058809960483771749
>>>   LOAD:276118:1.87831573032E11
>>>   SCHEMA:267327:af4461c3-d269-39bc-9d03-3566031c1e0a
>>>   DC:6:<some-ec2-dc>
>>>   RACK:8:rr
>>>   RELEASE_VERSION:4:3.0.6
>>>   SEVERITY:276217:0.32786884903907776
>>>   NET_VERSION:1:10
>>>   HOST_ID:2:c35b5d51-2d14-4334-9ffc-726f9dd8a214
>>>   RPC_READY:550:true
>>>   TOKENS:519:<hidden>
>>> /10.x.x.f
>>>   STATUS:1081:NORMAL,-1039671799603495012
>>>   LOAD:239114:2.09082017545E11
>>>   SCHEMA:230229:af4461c3-d269-39bc-9d03-3566031c1e0a
>>>   DC:6:<some-ec2-dc>
>>>   RACK:8:rr
>>>   RELEASE_VERSION:4:3.0.6
>>>   SEVERITY:239180:0.5665722489356995
>>>   NET_VERSION:1:10
>>>   HOST_ID:2:f20f7a87-d5d2-4f38-a963-21e24167b8ac
>>>   RPC_READY:1118:true
>>>   TOKENS:1080:<hidden>
>>>
>>>
>>>
>>> ## About load and tokens:
>>>
>>> - While load is pretty even this does not apply to tokens, I guess we
>>> have some table with uneven distribution. This should not be the case for
>>> high load tabels as partition keys are are build with some 'id + <some time
>>> format>'
>>>
>>> - I was not able to find some documentation about the numbers printed
>>> next to LOAD, SCHEMA, SEVERITY, RPC_READY ... Is there any doc around ?
>>>
>>>
>>>
>>> # Tombstones
>>>
>>> No ERRORS, only WARN about a very specific table that we are aware of.
>>> It is an append only table read by spark from a batch job. (I guess it is a
>>> read_repair chance or DTCS misconfig)
>>>
>>>
>>>
>>> ## Closing note!
>>>
>>> We are on very only m1.xlarge 4 vcpu and raid0 (stripe) on the 4
>>> spinning drives, some changes to the cassandra.yml:
>>>
>>>
>>>
>>> - dynamic_snitch: false
>>>
>>> - concurrent_reads: 48
>>>
>>> - concurrent_compactors: 1 (was 2)
>>>
>>> - disk_optimization_strategy: spinning
>>>
>>>
>>>
>>> I have some concerns about the number of concurrent_compactors, what do
>>> you think?
>>>
>>>
>>>
>>> Thanks!
>>>
>>>
>>> The contents of this e-mail are intended for the named addressee only.
>>> It contains information that may be confidential. Unless you are the named
>>> addressee or an authorized designee, you may not copy or use it, or
>>> disclose it to anyone else. If you received it in error please notify us
>>> immediately and then destroy it. Dynatrace Austria GmbH (registration
>>> number FN 91482h) is a company registered in Linz whose registered office
>>> is at 4040 Linz, Austria, Freistädterstraße 313
>>>
>>
>

Re: Read timeouts when performing rolling restart

Posted by Riccardo Ferrari <fe...@gmail.com>.

A little update on the progress.

First:
Thank you Thomas. I checked the code in the patch and briefly skimmed
through the 3.0.6 code. Yup it should be fixed.
Thank you Surbhi. At the moment we don't need authentication as the
instances are locked down.

Now:
- Unfortunately the start_transport_native trick does not always work. On
some nodes works on other don't. What do I mean? I still experience
timeouts and dropped messages during startup.
- I realized that cutting the concurrent_compactors to 1 was not really a
good idea, minimum vlaue should be 2, currently testing 4 (that is the
min(n_cores, n_disks))
- After rising the compactors to 4 I still see some dropped messages for
HINT and MUTATIONS. This happens during startup. Reason is "for internal
timeout". Maybe too many compactors?

Thanks!


On Wed, Sep 12, 2018 at 7:09 PM, Surbhi Gupta <su...@gmail.com>
wrote:

> Another thing to notice is :
>
> system_auth WITH replication = {'class': 'SimpleStrategy',
> 'replication_factor': '1'}
>
> system_auth has a replication factor of 1 and even if one node is down it
> may impact the system because of the replication factor.
>
>
>
> On Wed, 12 Sep 2018 at 09:46, Steinmaurer, Thomas <
> thomas.steinmaurer@dynatrace.com> wrote:
>
>> Hi,
>>
>>
>>
>> I remember something that a client using the native protocol gets
>> notified too early by Cassandra being ready due to the following issue:
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-8236
>>
>>
>>
>> which looks similar, but above was marked as fixed in 2.2.
>>
>>
>>
>> Thomas
>>
>>
>>
>> *From:* Riccardo Ferrari <fe...@gmail.com>
>> *Sent:* Mittwoch, 12. September 2018 18:25
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Read timeouts when performing rolling restart
>>
>>
>>
>> Hi Alain,
>>
>>
>>
>> Thank you for chiming in!
>>
>>
>>
>> I was thinking to perform the 'start_native_transport=false' test as well
>> and indeed the issue is not showing up. Starting the/a node with native
>> transport disabled and letting it cool down lead to no timeout exceptions
>> no dropped messages, simply a crystal clean startup. Agreed it is a
>> workaround
>>
>>
>>
>> # About upgrading:
>>
>> Yes, I desperately want to upgrade despite is a long and slow task. Just
>> reviewing all the changes from 3.0.6 to 3.0.17
>> is going to be a huge pain, top of your head, any breaking change I
>> should absolutely take care of reviewing ?
>>
>>
>>
>> # describecluster output: YES they agree on the same schema version
>>
>>
>>
>> # keyspaces:
>>
>> system WITH replication = {'class': 'LocalStrategy'}
>>
>> system_schema WITH replication = {'class': 'LocalStrategy'}
>>
>> system_auth WITH replication = {'class': 'SimpleStrategy',
>> 'replication_factor': '1'}
>>
>> system_distributed WITH replication = {'class': 'SimpleStrategy',
>> 'replication_factor': '3'}
>>
>> system_traces WITH replication = {'class': 'SimpleStrategy',
>> 'replication_factor': '2'}
>>
>>
>>
>> <custom1> WITH replication = {'class': 'SimpleStrategy',
>> 'replication_factor': '3'}
>>
>> <custom2>  WITH replication = {'class': 'SimpleStrategy',
>> 'replication_factor': '3'}
>>
>>
>>
>> # Snitch
>>
>> Ec2Snitch
>>
>>
>>
>> ## About Snitch and replication:
>>
>> - We have the default DC and all nodes are in the same RACK
>>
>> - We are planning to move to GossipingPropertyFileSnitch configuring the
>> cassandra-rackdc accortingly.
>>
>> -- This should be a transparent change, correct?
>>
>>
>>
>> - Once switched to GPFS, we plan to move to 'NetworkTopologyStrategy'
>> with 'us-xxxx' DC and replica counts as before
>>
>> - Then adding a new DC inside the VPC, but this is another story...
>>
>>
>>
>> Any concerns here ?
>>
>>
>>
>> # nodetool status <ks>
>>
>> --  Address         Load       Tokens       Owns (effective)  Host
>> ID                               Rack
>> UN  10.x.x.a  177 GB     256          50.3%
>> d8bfe4ad-8138-41fe-89a4-ee9a043095b5  rr
>> UN  10.x.x.b    152.46 GB  256          51.8%
>> 7888c077-346b-4e09-96b0-9f6376b8594f  rr
>> UN  10.x.x.c   159.59 GB  256          49.0%
>> 329b288e-c5b5-4b55-b75e-fbe9243e75fa  rr
>> UN  10.x.x.d  162.44 GB  256          49.3%
>> 07038c11-d200-46a0-9f6a-6e2465580fb1  rr
>> UN  10.x.x.e    174.9 GB   256          50.5%
>> c35b5d51-2d14-4334-9ffc-726f9dd8a214  rr
>> UN  10.x.x.f  194.71 GB  256          49.2%
>> f20f7a87-d5d2-4f38-a963-21e24167b8ac  rr
>>
>>
>>
>> # gossipinfo
>>
>> /10.x.x.a
>>   STATUS:827:NORMAL,-1350078789194251746
>>   LOAD:289986:1.90078037902E11
>>   SCHEMA:281088:af4461c3-d269-39bc-9d03-3566031c1e0a
>>   DC:6:<some-ec2-dc>
>>   RACK:8:rr
>>   RELEASE_VERSION:4:3.0.6
>>   SEVERITY:290040:0.5934718251228333
>>   NET_VERSION:1:10
>>   HOST_ID:2:d8bfe4ad-8138-41fe-89a4-ee9a043095b5
>>   RPC_READY:868:true
>>   TOKENS:826:<hidden>
>> /10.x.x.b
>>   STATUS:16:NORMAL,-1023229528754013265
>>   LOAD:7113:1.63730480619E11
>>   SCHEMA:10:af4461c3-d269-39bc-9d03-3566031c1e0a
>>   DC:6:<some-ec2-dc>
>>   RACK:8:rr
>>   RELEASE_VERSION:4:3.0.6
>>   SEVERITY:7274:0.5988024473190308
>>   NET_VERSION:1:10
>>   HOST_ID:2:7888c077-346b-4e09-96b0-9f6376b8594f
>>   TOKENS:15:<hidden>
>> /10.x.x.c
>>   STATUS:732:NORMAL,-1117172759238888547
>>   LOAD:245839:1.71409806942E11
>>   SCHEMA:237168:af4461c3-d269-39bc-9d03-3566031c1e0a
>>   DC:6:<some-ec2-dc>
>>   RACK:8:rr
>>   RELEASE_VERSION:4:3.0.6
>>   SEVERITY:245989:0.0
>>   NET_VERSION:1:10
>>   HOST_ID:2:329b288e-c5b5-4b55-b75e-fbe9243e75fa
>>   RPC_READY:763:true
>>   TOKENS:731:<hidden>
>> /10.x.x.d
>>   STATUS:14:NORMAL,-1004942496246544417
>>   LOAD:313125:1.74447964917E11
>>   SCHEMA:304268:af4461c3-d269-39bc-9d03-3566031c1e0a
>>   DC:6:<some-ec2-dc>
>>   RACK:8:rr
>>   RELEASE_VERSION:4:3.0.6
>>   SEVERITY:313215:0.25641027092933655
>>   NET_VERSION:1:10
>>   HOST_ID:2:07038c11-d200-46a0-9f6a-6e2465580fb1
>>   RPC_READY:56:true
>>   TOKENS:13:<hidden>
>> /10.x.x.e
>>   STATUS:520:NORMAL,-1058809960483771749
>>   LOAD:276118:1.87831573032E11
>>   SCHEMA:267327:af4461c3-d269-39bc-9d03-3566031c1e0a
>>   DC:6:<some-ec2-dc>
>>   RACK:8:rr
>>   RELEASE_VERSION:4:3.0.6
>>   SEVERITY:276217:0.32786884903907776
>>   NET_VERSION:1:10
>>   HOST_ID:2:c35b5d51-2d14-4334-9ffc-726f9dd8a214
>>   RPC_READY:550:true
>>   TOKENS:519:<hidden>
>> /10.x.x.f
>>   STATUS:1081:NORMAL,-1039671799603495012
>>   LOAD:239114:2.09082017545E11
>>   SCHEMA:230229:af4461c3-d269-39bc-9d03-3566031c1e0a
>>   DC:6:<some-ec2-dc>
>>   RACK:8:rr
>>   RELEASE_VERSION:4:3.0.6
>>   SEVERITY:239180:0.5665722489356995
>>   NET_VERSION:1:10
>>   HOST_ID:2:f20f7a87-d5d2-4f38-a963-21e24167b8ac
>>   RPC_READY:1118:true
>>   TOKENS:1080:<hidden>
>>
>>
>>
>> ## About load and tokens:
>>
>> - While load is pretty even this does not apply to tokens, I guess we
>> have some table with uneven distribution. This should not be the case for
>> high load tabels as partition keys are are build with some 'id + <some time
>> format>'
>>
>> - I was not able to find some documentation about the numbers printed
>> next to LOAD, SCHEMA, SEVERITY, RPC_READY ... Is there any doc around ?
>>
>>
>>
>> # Tombstones
>>
>> No ERRORS, only WARN about a very specific table that we are aware of. It
>> is an append only table read by spark from a batch job. (I guess it is a
>> read_repair chance or DTCS misconfig)
>>
>>
>>
>> ## Closing note!
>>
>> We are on very only m1.xlarge 4 vcpu and raid0 (stripe) on the 4 spinning
>> drives, some changes to the cassandra.yml:
>>
>>
>>
>> - dynamic_snitch: false
>>
>> - concurrent_reads: 48
>>
>> - concurrent_compactors: 1 (was 2)
>>
>> - disk_optimization_strategy: spinning
>>
>>
>>
>> I have some concerns about the number of concurrent_compactors, what do
>> you think?
>>
>>
>>
>> Thanks!
>>
>>
>> The contents of this e-mail are intended for the named addressee only. It
>> contains information that may be confidential. Unless you are the named
>> addressee or an authorized designee, you may not copy or use it, or
>> disclose it to anyone else. If you received it in error please notify us
>> immediately and then destroy it. Dynatrace Austria GmbH (registration
>> number FN 91482h) is a company registered in Linz whose registered office
>> is at 4040 Linz, Austria, Freistädterstraße 313
>>
>

Re: Read timeouts when performing rolling restart

Posted by Surbhi Gupta <su...@gmail.com>.

Another thing to notice is :

system_auth WITH replication = {'class': 'SimpleStrategy',
'replication_factor': '1'}

system_auth has a replication factor of 1 and even if one node is down it
may impact the system because of the replication factor.



On Wed, 12 Sep 2018 at 09:46, Steinmaurer, Thomas <
thomas.steinmaurer@dynatrace.com> wrote:

> Hi,
>
>
>
> I remember something that a client using the native protocol gets notified
> too early by Cassandra being ready due to the following issue:
>
> https://issues.apache.org/jira/browse/CASSANDRA-8236
>
>
>
> which looks similar, but above was marked as fixed in 2.2.
>
>
>
> Thomas
>
>
>
> *From:* Riccardo Ferrari <fe...@gmail.com>
> *Sent:* Mittwoch, 12. September 2018 18:25
> *To:* user@cassandra.apache.org
> *Subject:* Re: Read timeouts when performing rolling restart
>
>
>
> Hi Alain,
>
>
>
> Thank you for chiming in!
>
>
>
> I was thinking to perform the 'start_native_transport=false' test as well
> and indeed the issue is not showing up. Starting the/a node with native
> transport disabled and letting it cool down lead to no timeout exceptions
> no dropped messages, simply a crystal clean startup. Agreed it is a
> workaround
>
>
>
> # About upgrading:
>
> Yes, I desperately want to upgrade despite is a long and slow task. Just
> reviewing all the changes from 3.0.6 to 3.0.17
> is going to be a huge pain, top of your head, any breaking change I should
> absolutely take care of reviewing ?
>
>
>
> # describecluster output: YES they agree on the same schema version
>
>
>
> # keyspaces:
>
> system WITH replication = {'class': 'LocalStrategy'}
>
> system_schema WITH replication = {'class': 'LocalStrategy'}
>
> system_auth WITH replication = {'class': 'SimpleStrategy',
> 'replication_factor': '1'}
>
> system_distributed WITH replication = {'class': 'SimpleStrategy',
> 'replication_factor': '3'}
>
> system_traces WITH replication = {'class': 'SimpleStrategy',
> 'replication_factor': '2'}
>
>
>
> <custom1> WITH replication = {'class': 'SimpleStrategy',
> 'replication_factor': '3'}
>
> <custom2>  WITH replication = {'class': 'SimpleStrategy',
> 'replication_factor': '3'}
>
>
>
> # Snitch
>
> Ec2Snitch
>
>
>
> ## About Snitch and replication:
>
> - We have the default DC and all nodes are in the same RACK
>
> - We are planning to move to GossipingPropertyFileSnitch configuring the
> cassandra-rackdc accortingly.
>
> -- This should be a transparent change, correct?
>
>
>
> - Once switched to GPFS, we plan to move to 'NetworkTopologyStrategy' with
> 'us-xxxx' DC and replica counts as before
>
> - Then adding a new DC inside the VPC, but this is another story...
>
>
>
> Any concerns here ?
>
>
>
> # nodetool status <ks>
>
> --  Address         Load       Tokens       Owns (effective)  Host
> ID                               Rack
> UN  10.x.x.a  177 GB     256          50.3%
> d8bfe4ad-8138-41fe-89a4-ee9a043095b5  rr
> UN  10.x.x.b    152.46 GB  256          51.8%
> 7888c077-346b-4e09-96b0-9f6376b8594f  rr
> UN  10.x.x.c   159.59 GB  256          49.0%
> 329b288e-c5b5-4b55-b75e-fbe9243e75fa  rr
> UN  10.x.x.d  162.44 GB  256          49.3%
> 07038c11-d200-46a0-9f6a-6e2465580fb1  rr
> UN  10.x.x.e    174.9 GB   256          50.5%
> c35b5d51-2d14-4334-9ffc-726f9dd8a214  rr
> UN  10.x.x.f  194.71 GB  256          49.2%
> f20f7a87-d5d2-4f38-a963-21e24167b8ac  rr
>
>
>
> # gossipinfo
>
> /10.x.x.a
>   STATUS:827:NORMAL,-1350078789194251746
>   LOAD:289986:1.90078037902E11
>   SCHEMA:281088:af4461c3-d269-39bc-9d03-3566031c1e0a
>   DC:6:<some-ec2-dc>
>   RACK:8:rr
>   RELEASE_VERSION:4:3.0.6
>   SEVERITY:290040:0.5934718251228333
>   NET_VERSION:1:10
>   HOST_ID:2:d8bfe4ad-8138-41fe-89a4-ee9a043095b5
>   RPC_READY:868:true
>   TOKENS:826:<hidden>
> /10.x.x.b
>   STATUS:16:NORMAL,-1023229528754013265
>   LOAD:7113:1.63730480619E11
>   SCHEMA:10:af4461c3-d269-39bc-9d03-3566031c1e0a
>   DC:6:<some-ec2-dc>
>   RACK:8:rr
>   RELEASE_VERSION:4:3.0.6
>   SEVERITY:7274:0.5988024473190308
>   NET_VERSION:1:10
>   HOST_ID:2:7888c077-346b-4e09-96b0-9f6376b8594f
>   TOKENS:15:<hidden>
> /10.x.x.c
>   STATUS:732:NORMAL,-1117172759238888547
>   LOAD:245839:1.71409806942E11
>   SCHEMA:237168:af4461c3-d269-39bc-9d03-3566031c1e0a
>   DC:6:<some-ec2-dc>
>   RACK:8:rr
>   RELEASE_VERSION:4:3.0.6
>   SEVERITY:245989:0.0
>   NET_VERSION:1:10
>   HOST_ID:2:329b288e-c5b5-4b55-b75e-fbe9243e75fa
>   RPC_READY:763:true
>   TOKENS:731:<hidden>
> /10.x.x.d
>   STATUS:14:NORMAL,-1004942496246544417
>   LOAD:313125:1.74447964917E11
>   SCHEMA:304268:af4461c3-d269-39bc-9d03-3566031c1e0a
>   DC:6:<some-ec2-dc>
>   RACK:8:rr
>   RELEASE_VERSION:4:3.0.6
>   SEVERITY:313215:0.25641027092933655
>   NET_VERSION:1:10
>   HOST_ID:2:07038c11-d200-46a0-9f6a-6e2465580fb1
>   RPC_READY:56:true
>   TOKENS:13:<hidden>
> /10.x.x.e
>   STATUS:520:NORMAL,-1058809960483771749
>   LOAD:276118:1.87831573032E11
>   SCHEMA:267327:af4461c3-d269-39bc-9d03-3566031c1e0a
>   DC:6:<some-ec2-dc>
>   RACK:8:rr
>   RELEASE_VERSION:4:3.0.6
>   SEVERITY:276217:0.32786884903907776
>   NET_VERSION:1:10
>   HOST_ID:2:c35b5d51-2d14-4334-9ffc-726f9dd8a214
>   RPC_READY:550:true
>   TOKENS:519:<hidden>
> /10.x.x.f
>   STATUS:1081:NORMAL,-1039671799603495012
>   LOAD:239114:2.09082017545E11
>   SCHEMA:230229:af4461c3-d269-39bc-9d03-3566031c1e0a
>   DC:6:<some-ec2-dc>
>   RACK:8:rr
>   RELEASE_VERSION:4:3.0.6
>   SEVERITY:239180:0.5665722489356995
>   NET_VERSION:1:10
>   HOST_ID:2:f20f7a87-d5d2-4f38-a963-21e24167b8ac
>   RPC_READY:1118:true
>   TOKENS:1080:<hidden>
>
>
>
> ## About load and tokens:
>
> - While load is pretty even this does not apply to tokens, I guess we have
> some table with uneven distribution. This should not be the case for high
> load tabels as partition keys are are build with some 'id + <some time
> format>'
>
> - I was not able to find some documentation about the numbers printed next
> to LOAD, SCHEMA, SEVERITY, RPC_READY ... Is there any doc around ?
>
>
>
> # Tombstones
>
> No ERRORS, only WARN about a very specific table that we are aware of. It
> is an append only table read by spark from a batch job. (I guess it is a
> read_repair chance or DTCS misconfig)
>
>
>
> ## Closing note!
>
> We are on very only m1.xlarge 4 vcpu and raid0 (stripe) on the 4 spinning
> drives, some changes to the cassandra.yml:
>
>
>
> - dynamic_snitch: false
>
> - concurrent_reads: 48
>
> - concurrent_compactors: 1 (was 2)
>
> - disk_optimization_strategy: spinning
>
>
>
> I have some concerns about the number of concurrent_compactors, what do
> you think?
>
>
>
> Thanks!
>
>
> The contents of this e-mail are intended for the named addressee only. It
> contains information that may be confidential. Unless you are the named
> addressee or an authorized designee, you may not copy or use it, or
> disclose it to anyone else. If you received it in error please notify us
> immediately and then destroy it. Dynatrace Austria GmbH (registration
> number FN 91482h) is a company registered in Linz whose registered office
> is at 4040 Linz, Austria, Freistädterstraße 313
>

RE: Read timeouts when performing rolling restart

Posted by "Steinmaurer, Thomas" <th...@dynatrace.com>.

Hi,

I remember something that a client using the native protocol gets notified too early by Cassandra being ready due to the following issue:
https://issues.apache.org/jira/browse/CASSANDRA-8236

which looks similar, but above was marked as fixed in 2.2.

Thomas

From: Riccardo Ferrari <fe...@gmail.com>
Sent: Mittwoch, 12. September 2018 18:25
To: user@cassandra.apache.org
Subject: Re: Read timeouts when performing rolling restart

Hi Alain,

Thank you for chiming in!

I was thinking to perform the 'start_native_transport=false' test as well and indeed the issue is not showing up. Starting the/a node with native transport disabled and letting it cool down lead to no timeout exceptions no dropped messages, simply a crystal clean startup. Agreed it is a workaround

# About upgrading:
Yes, I desperately want to upgrade despite is a long and slow task. Just reviewing all the changes from 3.0.6 to 3.0.17
is going to be a huge pain, top of your head, any breaking change I should absolutely take care of reviewing ?

# describecluster output: YES they agree on the same schema version

# keyspaces:
system WITH replication = {'class': 'LocalStrategy'}
system_schema WITH replication = {'class': 'LocalStrategy'}
system_auth WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'}
system_distributed WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '3'}
system_traces WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '2'}

<custom1> WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '3'}
<custom2>  WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '3'}

# Snitch
Ec2Snitch

## About Snitch and replication:
- We have the default DC and all nodes are in the same RACK
- We are planning to move to GossipingPropertyFileSnitch configuring the cassandra-rackdc accortingly.
-- This should be a transparent change, correct?

- Once switched to GPFS, we plan to move to 'NetworkTopologyStrategy' with 'us-xxxx' DC and replica counts as before
- Then adding a new DC inside the VPC, but this is another story...

Any concerns here ?

# nodetool status <ks>
--  Address         Load       Tokens       Owns (effective)  Host ID                               Rack
UN  10.x.x.a  177 GB     256          50.3%             d8bfe4ad-8138-41fe-89a4-ee9a043095b5  rr
UN  10.x.x.b    152.46 GB  256          51.8%             7888c077-346b-4e09-96b0-9f6376b8594f  rr
UN  10.x.x.c   159.59 GB  256          49.0%             329b288e-c5b5-4b55-b75e-fbe9243e75fa  rr
UN  10.x.x.d  162.44 GB  256          49.3%             07038c11-d200-46a0-9f6a-6e2465580fb1  rr
UN  10.x.x.e    174.9 GB   256          50.5%             c35b5d51-2d14-4334-9ffc-726f9dd8a214  rr
UN  10.x.x.f  194.71 GB  256          49.2%             f20f7a87-d5d2-4f38-a963-21e24167b8ac  rr

# gossipinfo
/10.x.x.a
  STATUS:827:NORMAL,-1350078789194251746
  LOAD:289986:1.90078037902E11
  SCHEMA:281088:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:<some-ec2-dc>
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:290040:0.5934718251228333
  NET_VERSION:1:10
  HOST_ID:2:d8bfe4ad-8138-41fe-89a4-ee9a043095b5
  RPC_READY:868:true
  TOKENS:826:<hidden>
/10.x.x.b
  STATUS:16:NORMAL,-1023229528754013265
  LOAD:7113:1.63730480619E11
  SCHEMA:10:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:<some-ec2-dc>
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:7274:0.5988024473190308
  NET_VERSION:1:10
  HOST_ID:2:7888c077-346b-4e09-96b0-9f6376b8594f
  TOKENS:15:<hidden>
/10.x.x.c
  STATUS:732:NORMAL,-1117172759238888547
  LOAD:245839:1.71409806942E11
  SCHEMA:237168:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:<some-ec2-dc>
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:245989:0.0
  NET_VERSION:1:10
  HOST_ID:2:329b288e-c5b5-4b55-b75e-fbe9243e75fa
  RPC_READY:763:true
  TOKENS:731:<hidden>
/10.x.x.d
  STATUS:14:NORMAL,-1004942496246544417
  LOAD:313125:1.74447964917E11
  SCHEMA:304268:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:<some-ec2-dc>
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:313215:0.25641027092933655
  NET_VERSION:1:10
  HOST_ID:2:07038c11-d200-46a0-9f6a-6e2465580fb1
  RPC_READY:56:true
  TOKENS:13:<hidden>
/10.x.x.e
  STATUS:520:NORMAL,-1058809960483771749
  LOAD:276118:1.87831573032E11
  SCHEMA:267327:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:<some-ec2-dc>
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:276217:0.32786884903907776
  NET_VERSION:1:10
  HOST_ID:2:c35b5d51-2d14-4334-9ffc-726f9dd8a214
  RPC_READY:550:true
  TOKENS:519:<hidden>
/10.x.x.f
  STATUS:1081:NORMAL,-1039671799603495012
  LOAD:239114:2.09082017545E11
  SCHEMA:230229:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:<some-ec2-dc>
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:239180:0.5665722489356995
  NET_VERSION:1:10
  HOST_ID:2:f20f7a87-d5d2-4f38-a963-21e24167b8ac
  RPC_READY:1118:true
  TOKENS:1080:<hidden>

## About load and tokens:
- While load is pretty even this does not apply to tokens, I guess we have some table with uneven distribution. This should not be the case for high load tabels as partition keys are are build with some 'id + <some time format>'
- I was not able to find some documentation about the numbers printed next to LOAD, SCHEMA, SEVERITY, RPC_READY ... Is there any doc around ?

# Tombstones
No ERRORS, only WARN about a very specific table that we are aware of. It is an append only table read by spark from a batch job. (I guess it is a read_repair chance or DTCS misconfig)

## Closing note!
We are on very only m1.xlarge 4 vcpu and raid0 (stripe) on the 4 spinning drives, some changes to the cassandra.yml:

- dynamic_snitch: false
- concurrent_reads: 48
- concurrent_compactors: 1 (was 2)
- disk_optimization_strategy: spinning

I have some concerns about the number of concurrent_compactors, what do you think?

Thanks!

The contents of this e-mail are intended for the named addressee only. It contains information that may be confidential. Unless you are the named addressee or an authorized designee, you may not copy or use it, or disclose it to anyone else. If you received it in error please notify us immediately and then destroy it. Dynatrace Austria GmbH (registration number FN 91482h) is a company registered in Linz whose registered office is at 4040 Linz, Austria, Freistädterstraße 313

Re: Read timeouts when performing rolling restart

Posted by Riccardo Ferrari <fe...@gmail.com>.

Hi Alain,

Thank you for chiming in!

I was thinking to perform the 'start_native_transport=false' test as well
and indeed the issue is not showing up. Starting the/a node with native
transport disabled and letting it cool down lead to no timeout exceptions
no dropped messages, simply a crystal clean startup. Agreed it is a
workaround

# About upgrading:
Yes, I desperately want to upgrade despite is a long and slow task. Just
reviewing all the changes from 3.0.6 to 3.0.17
is going to be a huge pain, top of your head, any breaking change I should
absolutely take care of reviewing ?

# describecluster output: YES they agree on the same schema version

# keyspaces:
system WITH replication = {'class': 'LocalStrategy'}
system_schema WITH replication = {'class': 'LocalStrategy'}
system_auth WITH replication = {'class': 'SimpleStrategy',
'replication_factor': '1'}
system_distributed WITH replication = {'class': 'SimpleStrategy',
'replication_factor': '3'}
system_traces WITH replication = {'class': 'SimpleStrategy',
'replication_factor': '2'}

<custom1> WITH replication = {'class': 'SimpleStrategy',
'replication_factor': '3'}
<custom2>  WITH replication = {'class': 'SimpleStrategy',
'replication_factor': '3'}

# Snitch
Ec2Snitch

## About Snitch and replication:
- We have the default DC and all nodes are in the same RACK
- We are planning to move to GossipingPropertyFileSnitch configuring the
cassandra-rackdc accortingly.
-- This should be a transparent change, correct?

- Once switched to GPFS, we plan to move to 'NetworkTopologyStrategy' with
'us-xxxx' DC and replica counts as before
- Then adding a new DC inside the VPC, but this is another story...

Any concerns here ?

# nodetool status <ks>
--  Address         Load       Tokens       Owns (effective)  Host
ID                               Rack
UN  10.x.x.a  177 GB     256          50.3%
d8bfe4ad-8138-41fe-89a4-ee9a043095b5  rr
UN  10.x.x.b    152.46 GB  256          51.8%
7888c077-346b-4e09-96b0-9f6376b8594f  rr
UN  10.x.x.c   159.59 GB  256          49.0%
329b288e-c5b5-4b55-b75e-fbe9243e75fa  rr
UN  10.x.x.d  162.44 GB  256          49.3%
07038c11-d200-46a0-9f6a-6e2465580fb1  rr
UN  10.x.x.e    174.9 GB   256          50.5%
c35b5d51-2d14-4334-9ffc-726f9dd8a214  rr
UN  10.x.x.f  194.71 GB  256          49.2%
f20f7a87-d5d2-4f38-a963-21e24167b8ac  rr

# gossipinfo
/10.x.x.a
  STATUS:827:NORMAL,-1350078789194251746
  LOAD:289986:1.90078037902E11
  SCHEMA:281088:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:<some-ec2-dc>
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:290040:0.5934718251228333
  NET_VERSION:1:10
  HOST_ID:2:d8bfe4ad-8138-41fe-89a4-ee9a043095b5
  RPC_READY:868:true
  TOKENS:826:<hidden>
/10.x.x.b
  STATUS:16:NORMAL,-1023229528754013265
  LOAD:7113:1.63730480619E11
  SCHEMA:10:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:<some-ec2-dc>
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:7274:0.5988024473190308
  NET_VERSION:1:10
  HOST_ID:2:7888c077-346b-4e09-96b0-9f6376b8594f
  TOKENS:15:<hidden>
/10.x.x.c
  STATUS:732:NORMAL,-1117172759238888547
  LOAD:245839:1.71409806942E11
  SCHEMA:237168:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:<some-ec2-dc>
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:245989:0.0
  NET_VERSION:1:10
  HOST_ID:2:329b288e-c5b5-4b55-b75e-fbe9243e75fa
  RPC_READY:763:true
  TOKENS:731:<hidden>
/10.x.x.d
  STATUS:14:NORMAL,-1004942496246544417
  LOAD:313125:1.74447964917E11
  SCHEMA:304268:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:<some-ec2-dc>
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:313215:0.25641027092933655
  NET_VERSION:1:10
  HOST_ID:2:07038c11-d200-46a0-9f6a-6e2465580fb1
  RPC_READY:56:true
  TOKENS:13:<hidden>
/10.x.x.e
  STATUS:520:NORMAL,-1058809960483771749
  LOAD:276118:1.87831573032E11
  SCHEMA:267327:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:<some-ec2-dc>
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:276217:0.32786884903907776
  NET_VERSION:1:10
  HOST_ID:2:c35b5d51-2d14-4334-9ffc-726f9dd8a214
  RPC_READY:550:true
  TOKENS:519:<hidden>
/10.x.x.f
  STATUS:1081:NORMAL,-1039671799603495012
  LOAD:239114:2.09082017545E11
  SCHEMA:230229:af4461c3-d269-39bc-9d03-3566031c1e0a
  DC:6:<some-ec2-dc>
  RACK:8:rr
  RELEASE_VERSION:4:3.0.6
  SEVERITY:239180:0.5665722489356995
  NET_VERSION:1:10
  HOST_ID:2:f20f7a87-d5d2-4f38-a963-21e24167b8ac
  RPC_READY:1118:true
  TOKENS:1080:<hidden>

## About load and tokens:
- While load is pretty even this does not apply to tokens, I guess we have
some table with uneven distribution. This should not be the case for high
load tabels as partition keys are are build with some 'id + <some time
format>'
- I was not able to find some documentation about the numbers printed next
to LOAD, SCHEMA, SEVERITY, RPC_READY ... Is there any doc around ?

# Tombstones
No ERRORS, only WARN about a very specific table that we are aware of. It
is an append only table read by spark from a batch job. (I guess it is a
read_repair chance or DTCS misconfig)

## Closing note!
We are on very only m1.xlarge 4 vcpu and raid0 (stripe) on the 4 spinning
drives, some changes to the cassandra.yml:

- dynamic_snitch: false
- concurrent_reads: 48
- concurrent_compactors: 1 (was 2)
- disk_optimization_strategy: spinning

I have some concerns about the number of concurrent_compactors, what do you
think?

Thanks!

Re: Read timeouts when performing rolling restart

Posted by Alain RODRIGUEZ <ar...@gmail.com>.

Hello Ricardo

How come that a single node is impacting the whole cluster?
>

It sounds weird indeed.

Is there a way to further delay the native transposrt startup?


You can configure 'start_native_transport: false' in 'cassandra.yaml'. (
https://github.com/apache/cassandra/blob/cassandra-3.0.6/conf/cassandra.yaml#L496
)
Then 'nodetool enablebinary' (
http://cassandra.apache.org/doc/latest/tools/nodetool/enablebinary.html)
when you are ready for it.

But I would consider this as a workaround, and it might not even work, I
hope it does though :).

Any hint on troubleshooting it further?
>

The version of Cassandra is quite an early Cassandra 3+. It's probably
worth to consider moving to 3.0.17, if not to solve this issue, not to face
other issues that were fixed since then.
To know if that would really help you, you can go through
https://github.com/apache/cassandra/blob/cassandra-3.0.17/CHANGES.txt

I am not too sure about what is going on, but here are some other things I
would look at to try to understand this:

Are all the nodes agreeing on the schema?
'nodetool describecluster'

Are all the keyspaces using the 'NetworkTopologyStrategy' and a replication
factor of 2+?
'cqlsh -e "DESCRIBE KEYSPACES;" '

What snitch are you using (in cassandra.yaml)?

What does ownership look like?
'nodetool status <ks>'

What about gossip?
'nodetool gossipinfo' or 'nodetool gossipinfo | grep STATUS' maybe.

A tombstone issue?
https://support.datastax.com/hc/en-us/articles/204612559-ReadTimeoutException-seen-when-using-the-java-driver-caused-by-excessive-tombstones

Any ERROR or WARN in the logs after the restart on this node and on other
nodes (you would see the tombstone issue here)?
'grep -e "WARN" -e "ERROR" /var/log/cassandra/system.log'

I hope one of those will help, let us know if you need help to interpret
some of the outputs,

C*heers,
-----------------------
Alain Rodriguez - @arodream - alain@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com


Le mer. 12 sept. 2018 à 10:59, Riccardo Ferrari <fe...@gmail.com> a
écrit :

> Hi list,
>
> We are seeing the following behaviour when performing a rolling restart:
>
> On the node I need to restart:
> *  I run the 'nodetool drain'
> * Then 'service cassandra restart'
>
> so far so good. The load incerase on the other 5 nodes is negligible.
> The node is generally out of service just for the time of the restart (ie.
> cassandra.yml update)
>
> When the node comes back up and switch on the native transport I start see
> lots of read timeouts in our various services:
>
> com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra
> timeout during read query at consistency LOCAL_ONE (1 responses were
> required but only 0 replica responded)
>
> Indeed the restarting node have a huge peak on the system load, because of
> hints and compactions, nevertheless I don't notice a load increase on the
> other 5 nodes.
>
> Specs:
> 6 nodes cluster on Cassandra 3.0.6
> - keyspace RF=3
>
> Java driver 3.5.1:
> - DefaultRetryPolicy
> - default LoadBalancingPolicy (that should be DCAwareRoundRobinPolicy)
>
> QUESTIONS:
> How come that a single node is impacting the whole cluster?
> Is there a way to further delay the native transposrt startup?
> Any hint on troubleshooting it further?
>
> Thanks
>