You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Chuck Reynolds <cr...@ancestry.com> on 2017/08/30 15:50:33 UTC

system_auth replication factor in Cassandra 2.1

So I’ve read that if your using authentication in Cassandra 2.1 that your replication factor should match the number of nodes in your datacenter.

Is that true?

I have two datacenter cluster, 135 nodes in datacenter 1 & 227 nodes in an AWS datacenter.

Why do I want to replicate the system_auth table that many times?

What are the benefits and disadvantages of matching the number of nodes as opposed to the standard replication factor of 3?


The reason I’m asking the question is because it seems like I’m getting a lot of authentication errors now and they seem to happen more under load.

Also, querying the system_auth table from cqlsh to get the users seems to now timeout.


Any help would be greatly appreciated.

Thanks

Re: system_auth replication factor in Cassandra 2.1

Posted by Oleksandr Shulgin <ol...@zalando.de>.

On Wed, Aug 30, 2017 at 6:20 PM, Chuck Reynolds <cr...@ancestry.com>
wrote:

> So I tried to run a repair with the following on one of the server.
>
> nodetool repair system_auth -pr –local
>
>
>
> After two hours it hadn’t finished.  I had to kill the repair because of
> another issue and haven’t tried again.
>
>
>
> *Why would such a small table take so long to repair?*
>

It could be the overhead of that many nodes having to communicate with each
other (times the number of vnodes).  Even on a small clusters (3-5 nodes) I
think it takes a few minutes to run a repair on a small/empty keyspace.

*Also what would happen if I set the RF back to a lower number like 5?*
>

You should still run a repair afterwards, but I would expect it to finish
in a reasonable time.

--
Alex

Re: system_auth replication factor in Cassandra 2.1

Posted by Chuck Reynolds <cr...@ancestry.com>.

So I tried to run a repair with the following on one of the server.
nodetool repair system_auth -pr –local

After two hours it hadn’t finished.  I had to kill the repair because of another issue and haven’t tried again.

Why would such a small table take so long to repair?

Also what would happen if I set the RF back to a lower number like 5?


Thanks
From: <li...@beobal.com> on behalf of Sam Tunnicliffe <sa...@beobal.com>
Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Date: Wednesday, August 30, 2017 at 10:10 AM
To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Subject: Re: system_auth replication factor in Cassandra 2.1

It's a better rule of thumb to use an RF of 3 to 5 per DC and this is what the docs now suggest: http://cassandra.apache.org/doc/latest/operating/security.html#authentication
Out of the box, the system_auth keyspace is setup with SimpleStrategy and RF=1 so that it works on any new system including dev & test clusters, but obviously that's no use for a production system.

Regarding the increased rate of authentication errors: did you run repair after changing the RF? Auth queries are done at CL.LOCAL_ONE, so if you haven't repaired, the data for the user logging in will probably not be where it should be. The exception to this is the default "cassandra" user, queries for that user are done at CL.QUORUM, which will indeed lead to timeouts and authentication errors with a very high RF. It's recommended to only use that default user to bootstrap the setup of your own users & superusers, the link above also has info on this.

Thanks,
Sam


On 30 August 2017 at 16:50, Chuck Reynolds <cr...@ancestry.com>> wrote:
So I’ve read that if your using authentication in Cassandra 2.1 that your replication factor should match the number of nodes in your datacenter.

Is that true?

I have two datacenter cluster, 135 nodes in datacenter 1 & 227 nodes in an AWS datacenter.

Why do I want to replicate the system_auth table that many times?

What are the benefits and disadvantages of matching the number of nodes as opposed to the standard replication factor of 3?


The reason I’m asking the question is because it seems like I’m getting a lot of authentication errors now and they seem to happen more under load.

Also, querying the system_auth table from cqlsh to get the users seems to now timeout.


Any help would be greatly appreciated.

Thanks

Re: system_auth replication factor in Cassandra 2.1

Posted by Sam Tunnicliffe <sa...@beobal.com>.

It's a better rule of thumb to use an RF of 3 to 5 per DC and this is what
the docs now suggest:
http://cassandra.apache.org/doc/latest/operating/security.html#authentication

Out of the box, the system_auth keyspace is setup with SimpleStrategy and
RF=1 so that it works on any new system including dev & test clusters, but
obviously that's no use for a production system.

Regarding the increased rate of authentication errors: did you run repair
after changing the RF? Auth queries are done at CL.LOCAL_ONE, so if you
haven't repaired, the data for the user logging in will probably not be
where it should be. The exception to this is the default "cassandra" user,
queries for that user are done at CL.QUORUM, which will indeed lead to
timeouts and authentication errors with a very high RF. It's recommended to
only use that default user to bootstrap the setup of your own users &
superusers, the link above also has info on this.

Thanks,
Sam

On 30 August 2017 at 16:50, Chuck Reynolds <cr...@ancestry.com> wrote:

> So I’ve read that if your using authentication in Cassandra 2.1 that your
> replication factor should match the number of nodes in your datacenter.
>
>
>
> *Is that true?*
>
>
>
> I have two datacenter cluster, 135 nodes in datacenter 1 & 227 nodes in an
> AWS datacenter.
>
>
>
> *Why do I want to replicate the system_auth table that many times?*
>
>
>
> *What are the benefits and disadvantages of matching the number of nodes
> as opposed to the standard replication factor of 3? *
>
>
>
>
>
> The reason I’m asking the question is because it seems like I’m getting a
> lot of authentication errors now and they seem to happen more under load.
>
>
>
> Also, querying the system_auth table from cqlsh to get the users seems to
> now timeout.
>
>
>
>
>
> Any help would be greatly appreciated.
>
>
>
> Thanks
>

Re: system_auth replication factor in Cassandra 2.1

Posted by Nate McCall <na...@thelastpickle.com>.

Regardless, if you are not modifying users frequently (with five you most
likely are not), make sure turn the permission cache waaaayyy up.

In 2.1 that is just: permissions_validity_in_ms (default is 2000 or 2
seconds). Feel free to set it to 1 day or some such. The corresponding
async update parameter (permissions_update_interval_in_ms) can be set to a
slightly smaller value. If you really need to, you can drop the cache via
the "invalidate" operation on the
"org.apache.cassandra.auth:type=PermissionsCache" mbean (on each node) to
revoke a user for example.

In later versions, you would have to do the same with:
- roles_validity_in_ms
- credentials_validity_in_ms
and their corresponding 'interval' parameters.

Re: system_auth replication factor in Cassandra 2.1

Posted by kurt greaves <ku...@instaclustr.com>.

For that many nodes mixed with vnodes you probably want a lower RF than N
per datacenter. 5 or 7 would be reasonable. The only down side is that auth
queries may take slightly longer as they will often have to go to other
nodes to be resolved, but in practice this is likely not a big deal as the
data will be cached anyway.

Re: system_auth replication factor in Cassandra 2.1

Posted by Erick Ramirez <fl...@gmail.com>.

It looks like nodes .113 and .116 have a problem. Repairing system_auth
which only contains 5 users should not take that long. Run with just nodetool
repair system_auth (without the -pr flag).

But first investigate why those 2 nodes are slow to respond. Cheers!

On Thu, Aug 31, 2017 at 3:00 AM, Chuck Reynolds <cr...@ancestry.com>
wrote:

> select * from users;
>
>
>
> OK here’s the trace.  The times are super long.
>
> name          | super
>
> ---------------+-------
>
>       user1 | False
>
>       user2 | True
>
>       user3 | True
>
>       user4 | False
>
>       user5 | True
>
> (5 rows)
>
> Tracing session: 55a4aa50-8da3-11e7-adbb-e7bbc3a8a72e
>
> activity
>                                                                       |
> timestamp                  | source        | source_elapsed
>
> ------------------------------------------------------------
> ------------------------------------------------------------
> -----------+----------------------------+---------------+----------------
>
>
>                                          Execute CQL3 query | 2017-08-30
> 10:50:31.413000 | xx.xx.xx.113 |           0
>
>                                               READ message received from
> /xx.xx.xx.113 [MessagingService-Incoming-/xx.xx.xx.113] | 2017-08-30
> 10:50:31.398000 | xx.xx.xx.107 |             66
>
>
> Executing single-partition query on users [SharedPool-Worker-1] |
> 2017-08-30 10:50:31.399000 | xx.xx.xx.107 |            110
>
>                                                                                 Acquiring
> sstable references [SharedPool-Worker-1] | 2017-08-30 10:50:31.399000 |
> xx.xx.xx.107 |            114
>
>
>      Merging memtable tombstones [SharedPool-Worker-1] | 2017-08-30
> 10:50:31.400000 | xx.xx.xx.107 |            121
>
>
> Key cache hit for sstable 288 [SharedPool-Worker-1] | 2017-08-30
> 10:50:31.400000 | xx.xx.xx.107 |            129
>
>                                                                  Seeking
> to partition beginning in data file [SharedPool-Worker-1] | 2017-08-30
> 10:50:31.400000 | xx.xx.xx.107 |            130
>
>                                    Skipped 0/1 non-slice-intersecting
> sstables, included 0 due to tombstones [SharedPool-Worker-1] | 2017-08-30
> 10:50:31.401000 | xx.xx.xx.107 |            209
>
>                                                                   Merging
> data from memtables and 1 sstables [SharedPool-Worker-1] | 2017-08-30
> 10:50:31.401000 | xx.xx.xx.107 |            211
>
>
> Read 1 live and 0 tombstone cells [SharedPool-Worker-1] | 2017-08-30
> 10:50:31.402000 | xx.xx.xx.107 |            226
>
>
> Enqueuing response to /xx.xx.xx.113 [SharedPool-Worker-1] | 2017-08-30
> 10:50:31.402000 | xx.xx.xx.107 |            321
>
>                                      Sending REQUEST_RESPONSE message to
> /xx.xx.xx.113 [MessagingService-Outgoing-/xx.xx.xx.113] | 2017-08-30
> 10:50:31.402000 | xx.xx.xx.107 |            417
>
>
>  Parsing select * from users; [SharedPool-Worker-2] | 2017-08-30
> 10:50:31.414000 | xx.xx.xx.113 |             22
>
>
> Preparing statement [SharedPool-Worker-2] | 2017-08-30 10:50:31.415000 |
> xx.xx.xx.113 |             58
>
>
> reading data from /xx.xx.xx.107 [SharedPool-Worker-2] | 2017-08-30
> 10:50:31.415000 | xx.xx.xx.113 |            950
>
>                                                  Sending READ message to
> /xx.xx.xx.107 [MessagingService-Outgoing-/xx.xx.xx.107] | 2017-08-30
> 10:50:31.415000 | xx.xx.xx.113 |           1017
>
>                                   REQUEST_RESPONSE message received from
> /xx.xx.xx.107 [MessagingService-Incoming-/xx.xx.xx.107] | 2017-08-30
> 10:50:31.415000 | xx.xx.xx.113 |           1744
>
>
> Processing response from /xx.xx.xx.107 [SharedPool-Worker-1] | 2017-08-30
> 10:50:31.416000 | xx.xx.xx.113 |           1805
>
>
> Computing ranges to query [SharedPool-Worker-2] | 2017-08-30
> 10:50:31.416000 | xx.xx.xx.113 |           1853
>
> Submitting range requests on 63681 ranges with a concurrency of 10056
> (0.009944752 rows per range expected) [SharedPool-Worker-2] | 2017-08-30
> 10:50:31.424000 | xx.xx.xx.113 |          11427
>
>                                        PAGED_RANGE message received from
> /xx.xx.xx.113 [MessagingService-Incoming-/xx.xx.xx.113] | 2017-08-30
> 10:51:25.002000 | xx.xx.xx.116 |             28
>
>              Executing seq scan across 1 sstables for
> [min(-9223372036854775808), min(-9223372036854775808)]
> [SharedPool-Worker-1] | 2017-08-30 10:51:25.002000 | xx.xx.xx.116
> |             82
>
>
> Read 1 live and 0 tombstone cells [SharedPool-Worker-1] | 2017-08-30
> 10:51:25.003000 | xx.xx.xx.116 |            178
>
>
> Read 1 live and 0 tombstone cells [SharedPool-Worker-1] | 2017-08-30
> 10:51:25.003000 | xx.xx.xx.116 |            186
>
>
>            Read 1 live and 0 tombstone cells [SharedPool-Worker-1] |
> 2017-08-30 10:51:25.003000 | xx.xx.xx.116 |            191
>
>
> Read 1 live and 0 tombstone cells [SharedPool-Worker-1] | 2017-08-30
> 10:51:25.003000 | xx.xx.xx.116 |            194
>
>
> Read 1 live and 0 tombstone cells [SharedPool-Worker-1] | 2017-08-30
> 10:51:25.004000 | xx.xx.xx.116 |            198
>
>
> Scanned 5 rows and matched 5 [SharedPool-Worker-1] | 2017-08-30
> 10:51:25.004000 | xx.xx.xx.116 |            224
>
>
>         Enqueuing response to /xx.xx.xx.113 [SharedPool-Worker-1] |
> 2017-08-30 10:51:25.004000 | xx.xx.xx.116 |            240
>
>                                      Sending REQUEST_RESPONSE message to
> /xx.xx.xx.113 [MessagingService-Outgoing-/xx.xx.xx.113] | 2017-08-30
> 10:51:25.004000 | xx.xx.xx.116 |            302
>
>
> Enqueuing request to /xx.xx.xx.116 [SharedPool-Worker-2] | 2017-08-30
> 10:51:25.014000 | xx.xx.xx.113 |         601103
>
>                                                  Submitted 1 concurrent
> range requests covering 63681 ranges [SharedPool-Worker-2] | 2017-08-30
> 10:51:25.014000 | xx.xx.xx.113 |         601120
>
>                                           Sending PAGED_RANGE message to
> /xx.xx.xx.116 [MessagingService-Outgoing-/xx.xx.xx.116] | 2017-08-30
> 10:51:25.015000 | xx.xx.xx.113 |         601190
>
>                                   REQUEST_RESPONSE message received from
> /xx.xx.xx.116 [MessagingService-Incoming-/xx.xx.xx.116] | 2017-08-30
> 10:51:25.015000 | xx.xx.xx.113 |         601771
>
>
> Processing response from /xx.xx.xx.116 [SharedPool-Worker-1] | 2017-08-30
> 10:51:25.015000 | xx.xx.xx.113 |         601824
>
>
>                                              Request complete |
> 2017-08-30 10:51:25.014874 | xx.xx.xx.113 |         601874
>
>
>
>
>
> *From: *Oleksandr Shulgin <ol...@zalando.de>
> *Reply-To: *"user@cassandra.apache.org" <us...@cassandra.apache.org>
> *Date: *Wednesday, August 30, 2017 at 10:42 AM
> *To: *User <us...@cassandra.apache.org>
> *Subject: *Re: system_auth replication factor in Cassandra 2.1
>
>
>
> On Wed, Aug 30, 2017 at 6:40 PM, Chuck Reynolds <cr...@ancestry.com>
> wrote:
>
> How many users do you have (or expect to be found in system_auth.users)?
>
>   5 users.
>
> What are the current RF for system_auth and consistency level you are
> using in cqlsh?
>
>  135 in one DC and 227 in the other DC.  Consistency level one
>
>
>
> Still very surprising...
>
>
>
> Did you try to obtain a trace of a timing-out query (with TRACING ON)?
>
> Tracing timeout even though I increased it to 120 seconds.
>
>
>
> Even if cqlsh doesn't print the trace because of timeout, you should be
> still able to find something in system_traces.
>
>
>
> --
>
> Alex
>
>
>

Re: system_auth replication factor in Cassandra 2.1

Posted by Chuck Reynolds <cr...@ancestry.com>.

select * from users;

OK here’s the trace.  The times are super long.
name          | super
---------------+-------
      user1 | False
      user2 | True
      user3 | True
      user4 | False
      user5 | True
(5 rows)
Tracing session: 55a4aa50-8da3-11e7-adbb-e7bbc3a8a72e
activity                                                                                                                          | timestamp                  | source        | source_elapsed
-----------------------------------------------------------------------------------------------------------------------------------+----------------------------+---------------+----------------
                                                                                                                Execute CQL3 query | 2017-08-30 10:50:31.413000 | xx.xx.xx.113 |           0
                                              READ message received from /xx.xx.xx.113 [MessagingService-Incoming-/xx.xx.xx.113] | 2017-08-30 10:50:31.398000 | xx.xx.xx.107 |             66
                                                                   Executing single-partition query on users [SharedPool-Worker-1] | 2017-08-30 10:50:31.399000 | xx.xx.xx.107 |            110
                                                                                Acquiring sstable references [SharedPool-Worker-1] | 2017-08-30 10:50:31.399000 | xx.xx.xx.107 |            114
                                                                                 Merging memtable tombstones [SharedPool-Worker-1] | 2017-08-30 10:50:31.400000 | xx.xx.xx.107 |            121
                                                                               Key cache hit for sstable 288 [SharedPool-Worker-1] | 2017-08-30 10:50:31.400000 | xx.xx.xx.107 |            129
                                                                 Seeking to partition beginning in data file [SharedPool-Worker-1] | 2017-08-30 10:50:31.400000 | xx.xx.xx.107 |            130
                                   Skipped 0/1 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-1] | 2017-08-30 10:50:31.401000 | xx.xx.xx.107 |            209
                                                                  Merging data from memtables and 1 sstables [SharedPool-Worker-1] | 2017-08-30 10:50:31.401000 | xx.xx.xx.107 |            211
                                                                           Read 1 live and 0 tombstone cells [SharedPool-Worker-1] | 2017-08-30 10:50:31.402000 | xx.xx.xx.107 |            226
                                                                        Enqueuing response to /xx.xx.xx.113 [SharedPool-Worker-1] | 2017-08-30 10:50:31.402000 | xx.xx.xx.107 |            321
                                     Sending REQUEST_RESPONSE message to /xx.xx.xx.113 [MessagingService-Outgoing-/xx.xx.xx.113] | 2017-08-30 10:50:31.402000 | xx.xx.xx.107 |            417
                                                                                Parsing select * from users; [SharedPool-Worker-2] | 2017-08-30 10:50:31.414000 | xx.xx.xx.113 |             22
                                                                                         Preparing statement [SharedPool-Worker-2] | 2017-08-30 10:50:31.415000 | xx.xx.xx.113 |             58
                                                                            reading data from /xx.xx.xx.107 [SharedPool-Worker-2] | 2017-08-30 10:50:31.415000 | xx.xx.xx.113 |            950
                                                 Sending READ message to /xx.xx.xx.107 [MessagingService-Outgoing-/xx.xx.xx.107] | 2017-08-30 10:50:31.415000 | xx.xx.xx.113 |           1017
                                  REQUEST_RESPONSE message received from /xx.xx.xx.107 [MessagingService-Incoming-/xx.xx.xx.107] | 2017-08-30 10:50:31.415000 | xx.xx.xx.113 |           1744
                                                                     Processing response from /xx.xx.xx.107 [SharedPool-Worker-1] | 2017-08-30 10:50:31.416000 | xx.xx.xx.113 |           1805
                                                                                   Computing ranges to query [SharedPool-Worker-2] | 2017-08-30 10:50:31.416000 | xx.xx.xx.113 |           1853
Submitting range requests on 63681 ranges with a concurrency of 10056 (0.009944752 rows per range expected) [SharedPool-Worker-2] | 2017-08-30 10:50:31.424000 | xx.xx.xx.113 |          11427
                                       PAGED_RANGE message received from /xx.xx.xx.113 [MessagingService-Incoming-/xx.xx.xx.113] | 2017-08-30 10:51:25.002000 | xx.xx.xx.116 |             28
             Executing seq scan across 1 sstables for [min(-9223372036854775808), min(-9223372036854775808)] [SharedPool-Worker-1] | 2017-08-30 10:51:25.002000 | xx.xx.xx.116 |             82
                                                                           Read 1 live and 0 tombstone cells [SharedPool-Worker-1] | 2017-08-30 10:51:25.003000 | xx.xx.xx.116 |            178
                                                                           Read 1 live and 0 tombstone cells [SharedPool-Worker-1] | 2017-08-30 10:51:25.003000 | xx.xx.xx.116 |            186
                                                                           Read 1 live and 0 tombstone cells [SharedPool-Worker-1] | 2017-08-30 10:51:25.003000 | xx.xx.xx.116 |            191
                                                                           Read 1 live and 0 tombstone cells [SharedPool-Worker-1] | 2017-08-30 10:51:25.003000 | xx.xx.xx.116 |            194
                                                                           Read 1 live and 0 tombstone cells [SharedPool-Worker-1] | 2017-08-30 10:51:25.004000 | xx.xx.xx.116 |            198
                                                                                Scanned 5 rows and matched 5 [SharedPool-Worker-1] | 2017-08-30 10:51:25.004000 | xx.xx.xx.116 |            224
                                                                        Enqueuing response to /xx.xx.xx.113 [SharedPool-Worker-1] | 2017-08-30 10:51:25.004000 | xx.xx.xx.116 |            240
                                     Sending REQUEST_RESPONSE message to /xx.xx.xx.113 [MessagingService-Outgoing-/xx.xx.xx.113] | 2017-08-30 10:51:25.004000 | xx.xx.xx.116 |            302
                                                                         Enqueuing request to /xx.xx.xx.116 [SharedPool-Worker-2] | 2017-08-30 10:51:25.014000 | xx.xx.xx.113 |         601103
                                                 Submitted 1 concurrent range requests covering 63681 ranges [SharedPool-Worker-2] | 2017-08-30 10:51:25.014000 | xx.xx.xx.113 |         601120
                                          Sending PAGED_RANGE message to /xx.xx.xx.116 [MessagingService-Outgoing-/xx.xx.xx.116] | 2017-08-30 10:51:25.015000 | xx.xx.xx.113 |         601190
                                  REQUEST_RESPONSE message received from /xx.xx.xx.116 [MessagingService-Incoming-/xx.xx.xx.116] | 2017-08-30 10:51:25.015000 | xx.xx.xx.113 |         601771
                                                                     Processing response from /xx.xx.xx.116 [SharedPool-Worker-1] | 2017-08-30 10:51:25.015000 | xx.xx.xx.113 |         601824
                                                                                                                  Request complete | 2017-08-30 10:51:25.014874 | xx.xx.xx.113 |         601874


From: Oleksandr Shulgin <ol...@zalando.de>
Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Date: Wednesday, August 30, 2017 at 10:42 AM
To: User <us...@cassandra.apache.org>
Subject: Re: system_auth replication factor in Cassandra 2.1

On Wed, Aug 30, 2017 at 6:40 PM, Chuck Reynolds <cr...@ancestry.com>> wrote:
How many users do you have (or expect to be found in system_auth.users)?
  5 users.
What are the current RF for system_auth and consistency level you are using in cqlsh?
 135 in one DC and 227 in the other DC.  Consistency level one

Still very surprising...

Did you try to obtain a trace of a timing-out query (with TRACING ON)?
Tracing timeout even though I increased it to 120 seconds.

Even if cqlsh doesn't print the trace because of timeout, you should be still able to find something in system_traces.

--
Alex

Re: system_auth replication factor in Cassandra 2.1

Posted by Oleksandr Shulgin <ol...@zalando.de>.

On Wed, Aug 30, 2017 at 6:40 PM, Chuck Reynolds <cr...@ancestry.com>
wrote:

> How many users do you have (or expect to be found in system_auth.users)?
>
>   5 users.
>
> What are the current RF for system_auth and consistency level you are
> using in cqlsh?
>
>  135 in one DC and 227 in the other DC.  Consistency level one
>

Still very surprising...

Did you try to obtain a trace of a timing-out query (with TRACING ON)?
>
> Tracing timeout even though I increased it to 120 seconds.
>

Even if cqlsh doesn't print the trace because of timeout, you should be
still able to find something in system_traces.

--
Alex

Re: system_auth replication factor in Cassandra 2.1

Posted by Chuck Reynolds <cr...@ancestry.com>.

How many users do you have (or expect to be found in system_auth.users)?
  5 users.
What are the current RF for system_auth and consistency level you are using in cqlsh?
 135 in one DC and 227 in the other DC.  Consistency level one
Did you try to obtain a trace of a timing-out query (with TRACING ON)?
Tracing timeout even though I increased it to 120 seconds.

From: Oleksandr Shulgin <ol...@zalando.de>
Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Date: Wednesday, August 30, 2017 at 10:19 AM
To: User <us...@cassandra.apache.org>
Subject: Re: system_auth replication factor in Cassandra 2.1

On Wed, Aug 30, 2017 at 5:50 PM, Chuck Reynolds <cr...@ancestry.com>> wrote:
So I’ve read that if your using authentication in Cassandra 2.1 that your replication factor should match the number of nodes in your datacenter.

Is that true?

I have two datacenter cluster, 135 nodes in datacenter 1 & 227 nodes in an AWS datacenter.

Why do I want to replicate the system_auth table that many times?

What are the benefits and disadvantages of matching the number of nodes as opposed to the standard replication factor of 3?

The reason I’m asking the question is because it seems like I’m getting a lot of authentication errors now and they seem to happen more under load.

Also, querying the system_auth table from cqlsh to get the users seems to now timeout.

This is surprising.

How many users do you have (or expect to be found in system_auth.users)?   What are the current RF for system_auth and consistency level you are using in cqlsh?  Did you try to obtain a trace of a timing-out query (with TRACING ON)?

Regards,
--
Oleksandr "Alex" Shulgin | Database Engineer | Zalando SE | Tel: +49 176 127-59-707

Re: system_auth replication factor in Cassandra 2.1

Posted by Oleksandr Shulgin <ol...@zalando.de>.

On Wed, Aug 30, 2017 at 5:50 PM, Chuck Reynolds <cr...@ancestry.com>
wrote:

> So I’ve read that if your using authentication in Cassandra 2.1 that your
> replication factor should match the number of nodes in your datacenter.
>
>
>
> *Is that true?*
>
>
>
> I have two datacenter cluster, 135 nodes in datacenter 1 & 227 nodes in an
> AWS datacenter.
>
>
>
> *Why do I want to replicate the system_auth table that many times?*
>
>
>
> *What are the benefits and disadvantages of matching the number of nodes
> as opposed to the standard replication factor of 3? *
>
>
>
>
>
> The reason I’m asking the question is because it seems like I’m getting a
> lot of authentication errors now and they seem to happen more under load.
>
>
>
> Also, querying the system_auth table from cqlsh to get the users seems to
> now timeout.
>

This is surprising.

How many users do you have (or expect to be found in system_auth.users)?
What are the current RF for system_auth and consistency level you are using
in cqlsh?  Did you try to obtain a trace of a timing-out query (with
TRACING ON)?

Regards,
-- 
Oleksandr "Alex" Shulgin | Database Engineer | Zalando SE | Tel: +49 176
127-59-707

RE: system_auth replication factor in Cassandra 2.1

Posted by Jonathan Baynes <Jo...@tradeweb.com>.

I recently came across an issue where by my user Keyspace was replicated by 3 (I have 3 nodes) but my system_auth was default to 1, we also use authentication, I then lost 2 of my nodes and because authentication wasn’t replicated I couldn’t log in.

Once I resolved the issue, and got the nodes back up, I could then log back in, I too asked the community what was going on , and I was pointed to this

http://docs.datastax.com/en/datastax_enterprise/4.8/datastax_enterprise/sec/secConfSysAuthKeyspRepl.html

it clearly states the following

Attention: To prevent a potential problem logging into a secure cluster, set the replication factor of the system_auth and dse_security keyspaces to a value that is greater than 1. In a multi-node cluster, using the default of 1 prevents logging into any node when the node that stores the user data is down.



From: Chuck Reynolds [mailto:creynolds@ancestry.com]
Sent: 30 August 2017 16:51
To: user@cassandra.apache.org
Subject: system_auth replication factor in Cassandra 2.1

So I’ve read that if your using authentication in Cassandra 2.1 that your replication factor should match the number of nodes in your datacenter.

Is that true?

I have two datacenter cluster, 135 nodes in datacenter 1 & 227 nodes in an AWS datacenter.

Why do I want to replicate the system_auth table that many times?

What are the benefits and disadvantages of matching the number of nodes as opposed to the standard replication factor of 3?


The reason I’m asking the question is because it seems like I’m getting a lot of authentication errors now and they seem to happen more under load.

Also, querying the system_auth table from cqlsh to get the users seems to now timeout.


Any help would be greatly appreciated.

Thanks

________________________________________________________________________

This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy it. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden. Tradeweb reserves the right to monitor all e-mail communications through its networks. If you do not wish to receive marketing emails about our products / services, please let us know by contacting us, either by email at contactus@tradeweb.com or by writing to us at the registered office of Tradeweb in the UK, which is: Tradeweb Europe Limited (company number 3912826), 1 Fore Street Avenue London EC2Y 9DT. To see our privacy policy, visit our website @ www.tradeweb.com.