You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Clint Kelly <cl...@gmail.com> on 2014/03/31 01:51:10 UTC

Meaning of "token" column in system.peers and system.local

Hi all,


I am working on a Hadoop InputFormat implementation that uses only the
native protocol Java driver and not the Thrift API.  I am currently trying
to replicate some of the behavior of
*Cassandra.client.describe_ring(myKeyspace)* from the Thrift API.  I would
like to do the following:

   - Get a list of all of the token ranges for a cluster
   - For every token range, determine the replica nodes on which the data
   in the token range resides
   - Estimate the number of rows for every range of tokens
   - Groups ranges of tokens on common replica nodes such that we can
   create a set of input splits for Hadoop with total estimated line counts
   that are reasonably close to the requested split size

Last week I received some much-appreciated help on this list that pointed
me to using the system.peers table to get the list of token ranges for the
cluster and the corresponding hosts.  Today I created a three-node C*
cluster in Vagrant (https://github.com/dholbrook/vagrant-cassandra) and
tried inspecting some of the system tables.  I have a couple of questions
now:

1. *How many total unique tokens should I expect to see in my cluster?*  If
I have three nodes, and each node has a cassandra.yaml with num_tokens =
256, then should I expect a total of 256*3 = 768 distinct vnodes?

2. *How does the creation of vnodes and their assignment to nodes relate to
the replication factor for a given keyspace?*  I never thought about this
until today, and I tried to reread the documentation on virtual nodes,
replication in Cassandra, etc., and now I am sadly still confused.  Here is
what I think I understand.  :)

   - Given a row with a partition key, any client request for an operation
   on that row will go to a coordinator node in the cluster.
   - The coordinator node will compute the token value for the row and from
   that determine a set of replica nodes for that token.
      - One of the replica nodes I assume is the node that "owns" the vnode
      with the token range that encompasses the token
      - The identity of the "owner" of this virtual node is a
      cross-keyspace property
      - And the other replicas were originally chosen based on the
      replica-placement strategy
      - And therefore the other replicas will be different for each
      keyspace (because replication factors and replica-placement strategy are
      properties of a keyspace)

3. What do the values in the "token" column in system.peers and
system.local refer to then?

   - Since these tables appear to be global, and not per-keyspace
   properties, I assume that they don't have any information about replication
   in them, is that correct?
   - If I have three nodes in my cluster, 256 vnodes per node, and I'm
   using the Murmur3 partitioner, should I then expect to see the values of
   "tokens" in system.peers and system.local be 768 evenly-distributed values
   between -2^63 and 2^63?

4. Is there any other way, without using Thift, to get as much information
as possible about what nodes contain replicas of data for all of the token
ranges in a given cluster?

I really appreciate any help, thanks!

Best regards,
Clint

Re: Meaning of "token" column in system.peers and system.local

Posted by Clint Kelly <cl...@gmail.com>.
Hi Theo,

Thanks for your response.  I understand what you are saying with
regard to the load balancing.  I posted my question to the DataStax
list and one of the folks there answered it.  I put his response below
(for anyone who may be curious):

Sylvain Lebresne sylvain@datastax.com

4:03 AM (4 hours ago)

to java-driver-us.
The system tables are a bit specific in the sense that they are local
to the node that coordinate the query. And by default the java driver
round robin the queries over the node of the cluster. The result is
that more likely than not, your two system queries (on system.local
and system.peers) do not reach the same coordinator, hence what you
see.

It's possible to enforce that both query goes to the same coordinator
by mean of modifying/providing a custom load balancing policy. You
could for instance write a wrapper Statement class, that allow to
specify which node is supposed to be contacted, and then write a
custom load balancing policy that recognize this wrapper class and
force the user provided host if there is one (and say fallback on
another load balancing policy otherwise). Or, simpler but somewhat
less flexible, if all you want is to have 2 requests go to the same
coordinator (which is enough to get all tokens of a cluster really),
then you can make sure to use TokenAwarePolicy (a good idea anyway),
and make sure both query have the same "routing key" (whatever it is
is not all that important, you can use an empty ByteBuffer), see
SimpleStatement.setRoutingKey().

Note that I would agree that what's suggested above is slightly
involved and could be supported more "natively" by the driver. And I
do plan on exposing the cluster tokens more simply in particular
(probably directly from the Host object, it's just "a todo not yet
done". And I'll probably add the load balancing stuff + Statement
wrapper I describe above, because that's probably somewhat generally
useful for debugging for instance.  Still, it's possible to do
currently, just a bit more involved than is probably necessary.

--
Sylvain

On Mon, Mar 31, 2014 at 3:30 AM, Theo Hultberg <th...@iconara.net> wrote:
> your assumption about 256 tokens per node is correct.
>
> as for you second question, it seems to me like most of your assumptions are
> correct, but I'm not sure I understand them correctly. hopefully someone
> else can answer this better. tokens are a property of the cluster and not
> the keyspace. the first replica of any token will be the same for all
> keyspaces, but with different replication factors the other replicas will
> differ.
>
> when you query the system.local and system.peers tables you must make sure
> that you don't connect to other nodes. I think the inconsistency you think
> you found is because the first and second queries went to different nodes.
> the java driver will connect to all nodes and load balance requests by
> default.
>
> T#
>
>
> On Mon, Mar 31, 2014 at 4:06 AM, Clint Kelly <cl...@gmail.com> wrote:
>>
>> BTW one other thing that I have not been able to debug today that maybe
>> someone can help me with:
>>
>> I am using a three-node Cassandra cluster with Vagrant.  The nodes in my
>> cluster are 192.168.200.11, 192.168.200.12, and 192.168.200.13.
>>
>> If I use cqlsh to connect to 192.168.200.11, I see unique sets of tokens
>> when I run the following three commands:
>>
>> select tokens from system.local
>> select tokens from system.peers where peer=192.168.200.12
>> select tokens from system.peers where peer=192.168.200.13
>>
>> This is what I expect.  However, when I tried making an application with
>> the Java driver that does the following:
>>
>> Create a Session by connecting to 192.168.200.11
>> From that session, "select tokens from system.local"
>> From that session, "select tokens, peer from system.peers"
>>
>> Now I get the exact-same set of tokens from system.local and from the row
>> in system.peers in which peer=192.168.200.13.
>>
>> Anyone have any idea why this would happen?  I'm not sure how to debug
>> this.  I see the following log from the Java driver:
>>
>> 14/03/30 19:05:24 DEBUG com.datastax.driver.core.Cluster: Starting new
>> cluster with contact points [/192.168.200.11]
>> 14/03/30 19:05:24 INFO com.datastax.driver.core.Cluster: New Cassandra
>> host /192.168.200.13 added
>> 14/03/30 19:05:24 INFO com.datastax.driver.core.Cluster: New Cassandra
>> host /192.168.200.12 added
>>
>> I'm running Cassandra 2.0.6 in the virtual machine and I built my
>> application with version 2.0.1 of the driver.
>>
>> Best regards,
>> Clint
>>
>>
>>
>>
>>
>>
>>
>> On Sun, Mar 30, 2014 at 4:51 PM, Clint Kelly <cl...@gmail.com>
>> wrote:
>>>
>>> Hi all,
>>>
>>>
>>> I am working on a Hadoop InputFormat implementation that uses only the
>>> native protocol Java driver and not the Thrift API.  I am currently trying
>>> to replicate some of the behavior of
>>> Cassandra.client.describe_ring(myKeyspace) from the Thrift API.  I would
>>> like to do the following:
>>>
>>> Get a list of all of the token ranges for a cluster
>>> For every token range, determine the replica nodes on which the data in
>>> the token range resides
>>> Estimate the number of rows for every range of tokens
>>> Groups ranges of tokens on common replica nodes such that we can create a
>>> set of input splits for Hadoop with total estimated line counts that are
>>> reasonably close to the requested split size
>>>
>>> Last week I received some much-appreciated help on this list that pointed
>>> me to using the system.peers table to get the list of token ranges for the
>>> cluster and the corresponding hosts.  Today I created a three-node C*
>>> cluster in Vagrant (https://github.com/dholbrook/vagrant-cassandra) and
>>> tried inspecting some of the system tables.  I have a couple of questions
>>> now:
>>>
>>> 1. How many total unique tokens should I expect to see in my cluster?  If
>>> I have three nodes, and each node has a cassandra.yaml with num_tokens =
>>> 256, then should I expect a total of 256*3 = 768 distinct vnodes?
>>>
>>> 2. How does the creation of vnodes and their assignment to nodes relate
>>> to the replication factor for a given keyspace?  I never thought about this
>>> until today, and I tried to reread the documentation on virtual nodes,
>>> replication in Cassandra, etc., and now I am sadly still confused.  Here is
>>> what I think I understand.  :)
>>>
>>> Given a row with a partition key, any client request for an operation on
>>> that row will go to a coordinator node in the cluster.
>>> The coordinator node will compute the token value for the row and from
>>> that determine a set of replica nodes for that token.
>>>
>>> One of the replica nodes I assume is the node that "owns" the vnode with
>>> the token range that encompasses the token
>>> The identity of the "owner" of this virtual node is a cross-keyspace
>>> property
>>> And the other replicas were originally chosen based on the
>>> replica-placement strategy
>>> And therefore the other replicas will be different for each keyspace
>>> (because replication factors and replica-placement strategy are properties
>>> of a keyspace)
>>>
>>> 3. What do the values in the "token" column in system.peers and
>>> system.local refer to then?
>>>
>>> Since these tables appear to be global, and not per-keyspace properties,
>>> I assume that they don't have any information about replication in them, is
>>> that correct?
>>> If I have three nodes in my cluster, 256 vnodes per node, and I'm using
>>> the Murmur3 partitioner, should I then expect to see the values of "tokens"
>>> in system.peers and system.local be 768 evenly-distributed values between
>>> -2^63 and 2^63?
>>>
>>> 4. Is there any other way, without using Thift, to get as much
>>> information as possible about what nodes contain replicas of data for all of
>>> the token ranges in a given cluster?
>>>
>>> I really appreciate any help, thanks!
>>>
>>> Best regards,
>>> Clint
>>
>>
>

Re: Meaning of "token" column in system.peers and system.local

Posted by Theo Hultberg <th...@iconara.net>.
your assumption about 256 tokens per node is correct.

as for you second question, it seems to me like most of your assumptions
are correct, but I'm not sure I understand them correctly. hopefully
someone else can answer this better. tokens are a property of the cluster
and not the keyspace. the first replica of any token will be the same for
all keyspaces, but with different replication factors the other replicas
will differ.

when you query the system.local and system.peers tables you must make sure
that you don't connect to other nodes. I think the inconsistency you think
you found is because the first and second queries went to different nodes.
the java driver will connect to all nodes and load balance requests by
default.

T#


On Mon, Mar 31, 2014 at 4:06 AM, Clint Kelly <cl...@gmail.com> wrote:

> BTW one other thing that I have not been able to debug today that maybe
> someone can help me with:
>
> I am using a three-node Cassandra cluster with Vagrant.  The nodes in my
> cluster are 192.168.200.11, 192.168.200.12, and 192.168.200.13.
>
> If I use cqlsh to connect to 192.168.200.11, I see unique sets of tokens
> when I run the following three commands:
>
> select tokens from system.local
> select tokens from system.peers where peer=192.168.200.12
> select tokens from system.peers where peer=192.168.200.13
>
> This is what I expect.  However, when I tried making an application with
> the Java driver that does the following:
>
>
>    - Create a Session by connecting to 192.168.200.11
>    - From that session, "select tokens from system.local"
>    - From that session, "select tokens, peer from system.peers"
>
> Now I get the exact-same set of tokens from system.local and from the row
> in system.peers in which peer=192.168.200.13.
>
> Anyone have any idea why this would happen?  I'm not sure how to debug
> this.  I see the following log from the Java driver:
>
> 14/03/30 19:05:24 DEBUG com.datastax.driver.core.Cluster: Starting new
> cluster with contact points [/192.168.200.11]
> 14/03/30 19:05:24 INFO com.datastax.driver.core.Cluster: New Cassandra
> host /192.168.200.13 added
> 14/03/30 19:05:24 INFO com.datastax.driver.core.Cluster: New Cassandra
> host /192.168.200.12 added
>
> I'm running Cassandra 2.0.6 in the virtual machine and I built my
> application with version 2.0.1 of the driver.
>
> Best regards,
> Clint
>
>
>
>
>
>
>
> On Sun, Mar 30, 2014 at 4:51 PM, Clint Kelly <cl...@gmail.com>wrote:
>
>> Hi all,
>>
>>
>> I am working on a Hadoop InputFormat implementation that uses only the
>> native protocol Java driver and not the Thrift API.  I am currently trying
>> to replicate some of the behavior of
>> *Cassandra.client.describe_ring(myKeyspace)* from the Thrift API.  I
>> would like to do the following:
>>
>>    - Get a list of all of the token ranges for a cluster
>>    - For every token range, determine the replica nodes on which the
>>    data in the token range resides
>>    - Estimate the number of rows for every range of tokens
>>    - Groups ranges of tokens on common replica nodes such that we can
>>    create a set of input splits for Hadoop with total estimated line counts
>>    that are reasonably close to the requested split size
>>
>> Last week I received some much-appreciated help on this list that pointed
>> me to using the system.peers table to get the list of token ranges for the
>> cluster and the corresponding hosts.  Today I created a three-node C*
>> cluster in Vagrant (https://github.com/dholbrook/vagrant-cassandra) and
>> tried inspecting some of the system tables.  I have a couple of questions
>> now:
>>
>> 1. *How many total unique tokens should I expect to see in my cluster?*
>> If I have three nodes, and each node has a cassandra.yaml with num_tokens =
>> 256, then should I expect a total of 256*3 = 768 distinct vnodes?
>>
>> 2. *How does the creation of vnodes and their assignment to nodes relate
>> to the replication factor for a given keyspace?*  I never thought about
>> this until today, and I tried to reread the documentation on virtual nodes,
>> replication in Cassandra, etc., and now I am sadly still confused.  Here is
>> what I think I understand.  :)
>>
>>    - Given a row with a partition key, any client request for an
>>    operation on that row will go to a coordinator node in the cluster.
>>    - The coordinator node will compute the token value for the row and
>>    from that determine a set of replica nodes for that token.
>>       - One of the replica nodes I assume is the node that "owns" the
>>       vnode with the token range that encompasses the token
>>       - The identity of the "owner" of this virtual node is a
>>       cross-keyspace property
>>       - And the other replicas were originally chosen based on the
>>       replica-placement strategy
>>       - And therefore the other replicas will be different for each
>>       keyspace (because replication factors and replica-placement strategy are
>>       properties of a keyspace)
>>
>> 3. What do the values in the "token" column in system.peers and
>> system.local refer to then?
>>
>>    - Since these tables appear to be global, and not per-keyspace
>>    properties, I assume that they don't have any information about replication
>>    in them, is that correct?
>>    - If I have three nodes in my cluster, 256 vnodes per node, and I'm
>>    using the Murmur3 partitioner, should I then expect to see the values of
>>    "tokens" in system.peers and system.local be 768 evenly-distributed values
>>    between -2^63 and 2^63?
>>
>> 4. Is there any other way, without using Thift, to get as much
>> information as possible about what nodes contain replicas of data for all
>> of the token ranges in a given cluster?
>>
>> I really appreciate any help, thanks!
>>
>> Best regards,
>> Clint
>>
>
>

Re: Meaning of "token" column in system.peers and system.local

Posted by Clint Kelly <cl...@gmail.com>.
BTW one other thing that I have not been able to debug today that maybe
someone can help me with:

I am using a three-node Cassandra cluster with Vagrant.  The nodes in my
cluster are 192.168.200.11, 192.168.200.12, and 192.168.200.13.

If I use cqlsh to connect to 192.168.200.11, I see unique sets of tokens
when I run the following three commands:

select tokens from system.local
select tokens from system.peers where peer=192.168.200.12
select tokens from system.peers where peer=192.168.200.13

This is what I expect.  However, when I tried making an application with
the Java driver that does the following:


   - Create a Session by connecting to 192.168.200.11
   - From that session, "select tokens from system.local"
   - From that session, "select tokens, peer from system.peers"

Now I get the exact-same set of tokens from system.local and from the row
in system.peers in which peer=192.168.200.13.

Anyone have any idea why this would happen?  I'm not sure how to debug
this.  I see the following log from the Java driver:

14/03/30 19:05:24 DEBUG com.datastax.driver.core.Cluster: Starting new
cluster with contact points [/192.168.200.11]
14/03/30 19:05:24 INFO com.datastax.driver.core.Cluster: New Cassandra host
/192.168.200.13 added
14/03/30 19:05:24 INFO com.datastax.driver.core.Cluster: New Cassandra host
/192.168.200.12 added

I'm running Cassandra 2.0.6 in the virtual machine and I built my
application with version 2.0.1 of the driver.

Best regards,
Clint







On Sun, Mar 30, 2014 at 4:51 PM, Clint Kelly <cl...@gmail.com> wrote:

> Hi all,
>
>
> I am working on a Hadoop InputFormat implementation that uses only the
> native protocol Java driver and not the Thrift API.  I am currently trying
> to replicate some of the behavior of
> *Cassandra.client.describe_ring(myKeyspace)* from the Thrift API.  I
> would like to do the following:
>
>    - Get a list of all of the token ranges for a cluster
>    - For every token range, determine the replica nodes on which the data
>    in the token range resides
>    - Estimate the number of rows for every range of tokens
>    - Groups ranges of tokens on common replica nodes such that we can
>    create a set of input splits for Hadoop with total estimated line counts
>    that are reasonably close to the requested split size
>
> Last week I received some much-appreciated help on this list that pointed
> me to using the system.peers table to get the list of token ranges for the
> cluster and the corresponding hosts.  Today I created a three-node C*
> cluster in Vagrant (https://github.com/dholbrook/vagrant-cassandra) and
> tried inspecting some of the system tables.  I have a couple of questions
> now:
>
> 1. *How many total unique tokens should I expect to see in my cluster?*
> If I have three nodes, and each node has a cassandra.yaml with num_tokens =
> 256, then should I expect a total of 256*3 = 768 distinct vnodes?
>
> 2. *How does the creation of vnodes and their assignment to nodes relate
> to the replication factor for a given keyspace?*  I never thought about
> this until today, and I tried to reread the documentation on virtual nodes,
> replication in Cassandra, etc., and now I am sadly still confused.  Here is
> what I think I understand.  :)
>
>    - Given a row with a partition key, any client request for an
>    operation on that row will go to a coordinator node in the cluster.
>    - The coordinator node will compute the token value for the row and
>    from that determine a set of replica nodes for that token.
>       - One of the replica nodes I assume is the node that "owns" the
>       vnode with the token range that encompasses the token
>       - The identity of the "owner" of this virtual node is a
>       cross-keyspace property
>       - And the other replicas were originally chosen based on the
>       replica-placement strategy
>       - And therefore the other replicas will be different for each
>       keyspace (because replication factors and replica-placement strategy are
>       properties of a keyspace)
>
> 3. What do the values in the "token" column in system.peers and
> system.local refer to then?
>
>    - Since these tables appear to be global, and not per-keyspace
>    properties, I assume that they don't have any information about replication
>    in them, is that correct?
>    - If I have three nodes in my cluster, 256 vnodes per node, and I'm
>    using the Murmur3 partitioner, should I then expect to see the values of
>    "tokens" in system.peers and system.local be 768 evenly-distributed values
>    between -2^63 and 2^63?
>
> 4. Is there any other way, without using Thift, to get as much information
> as possible about what nodes contain replicas of data for all of the token
> ranges in a given cluster?
>
> I really appreciate any help, thanks!
>
> Best regards,
> Clint
>

Re: Meaning of "token" column in system.peers and system.local

Posted by Robert Coli <rc...@eventbrite.com>.
On Sun, Mar 30, 2014 at 4:51 PM, Clint Kelly <cl...@gmail.com> wrote:

> 1. *How many total unique tokens should I expect to see in my cluster?*
> If I have three nodes, and each node has a cassandra.yaml with num_tokens =
> 256, then should I expect a total of 256*3 = 768 distinct vnodes?
>

Yes. Generally, vnodes are just like nodes, except there are more of them
one of them per physical node.


> 2. *How does the creation of vnodes and their assignment to nodes relate
> to the replication factor for a given keyspace?*
>

The same way that it would if you created the same number of nodes with a
rack-unaware ("simple") snitch. If you have racks configured, it does the
rack thing with vnodes... which is less clear than in the CASSANDRA-3810
non-vnodes rack-aware no-op case, but logically the same.

> 3. What do the values in the "token" column in system.peers and
> system.local refer to then?
>
Node primary range ownership. Each node, v or not, has one and exactly one
token. The space between its token and the next token is the "primary
range" it is responsible for.

>
>    - 4. Is there any other way, without using Thift, to get as much
>    information as possible about what nodes contain replicas of data for all
>    of the token ranges in a given cluster.
>
> I don't know the CQL answer to this, but for JMX there is
getNaturalEndpoints.

=Rob