You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by kaveh minooie <ka...@plutoz.com> on 2017/09/11 22:39:30 UTC

load distribution that I can't explain

Hi every one

So I have a 2 node( node1, node2 ) cassandra 3.11 cluster on which I 
have a keyspace with a replication factor of 2. this keyspace has only 
this table:

CREATE KEYSPACE myks WITH replication = {'class': 'SimpleStrategy', 
'replication_factor': '2'}  AND durable_writes = true;

CREATE TABLE myks.table1 (
     id1 int,
     id2 int,
     id3 int,
     att1 int,
     PRIMARY KEY ((id1, id2, id3), att1)
) WITH CLUSTERING ORDER BY (att1 ASC)
     AND bloom_filter_fp_chance = 0.01
     AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
     AND comment = ''
     AND compaction = {'class': 
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 
'max_threshold': '32', 'min_threshold': '4'}
     AND compression = {'chunk_length_in_kb': '64', 'class': 
'org.apache.cassandra.io.compress.LZ4Compressor'}
     AND crc_check_chance = 1.0
     AND dclocal_read_repair_chance = 0.1
     AND default_time_to_live = 0
     AND gc_grace_seconds = 864000
     AND max_index_interval = 2048
     AND memtable_flush_period_in_ms = 0
     AND min_index_interval = 128
     AND read_repair_chance = 0.0
     AND speculative_retry = '99PERCENTILE';


I run two tasks against this table:

task one involves reading first:

"SELECT DISTINCT id1, id2, id3 FROM table1 WHERE id1 = :id1-value ALLOW 
FILTERING;";

and then, per each result, reading :
"SELECT COUNT( att1 ) FROM table1 WHERE id1 = :id1-value AND id2 = 
:id2-value AND id3 = :id3-value ;";

and once done adds new data by executing this:

"INSERT INTO table1 ( id1, id2, id3, att1 ) VALUES ( :id1-value, 
:id2-value, :id3-value, :att1-value ) USING TTL <ttl>;"

as long as there is data for different id1s. All of these are at CL one 
or any for insert.

task two only does the select part, but doesn't add any new data, again 
for a hundred different id1 values in each run. these are java 
applications and use com.datastax.driver.

my problem is that when I am running these tasks, specially task one, I 
always see a lot more cpu load, as in ,on average a ratio of 10 to 1 and 
sometimes even as high as 30 to 1 load, on node2 than node1. Both of 
these node have the same spec. I don't know how to explain this or what 
configuration parameter I need to look into in order to explain this, 
and I couldn't find any thing on-line either. Any hint or suggestion 
would be really appreciated.

thanks,

-- 
Kaveh Minooie

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org

Re: load distribution that I can't explain

Posted by kaveh minooie <ka...@plutoz.com>.

I am using RoundRobin

cluster = Cluster.builder()...( socket stuff, pool option stuff ... )
                     .withLoadBalancingPolicy( new RoundRobinPolicy() )
                     .addContactPoints( hosts )
                     .build();



On 09/13/2017 03:02 AM, kurt greaves wrote:
> Are you using a load balancing policy? That sounds like you are only 
> using node2 as a coordinator.

-- 
Kaveh Minooie

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org

Re: load distribution that I can't explain

Posted by kurt greaves <ku...@instaclustr.com>.

Are you using a load balancing policy? That sounds like you are only using
node2 as a coordinator.

Re: load distribution that I can't explain

Posted by kaveh minooie <ka...@plutoz.com>.

Hi kurt, thanks for responding.

I understand that that query is very resource consuming. My question was 
why I only see its effect on the same node? considering that I have a 
replication factor of 2, I was hoping to see this load evenly 
distributed among those 2 nodes. That query runs hundreds of time on 
each run, but the loads seems to be always on the node2. That is what I 
am trying to figure out.


On 09/11/2017 06:25 PM, kurt greaves wrote:
> Your first query will effectively have to perform table scans to satisfy 
> what you are asking. If a query requires ALLOW FILTERING to be 
> specified, it means that Cassandra can't really optimise that query in 
> any way and it's going to have to query a lot of data (all of it...) to 
> satisfy the result.
> Because you've only specified one attribute of the partitioning key, 
> Cassandra doesn't know where to look for that data, and will need to 
> query all of it to find partitions matching that restriction.
> 
> If you want to select distinct you should probably do it in a 
> distributed manner using token range scans, however this is generally 
> not a good use case for Cassandra. If you really need to know your 
> partitioning keys you should probably store them in a separate cache.
> 
> 

-- 
Kaveh Minooie

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@cassandra.apache.org
For additional commands, e-mail: user-help@cassandra.apache.org

Re: load distribution that I can't explain

Posted by kurt greaves <ku...@instaclustr.com>.

Your first query will effectively have to perform table scans to satisfy
what you are asking. If a query requires ALLOW FILTERING to be specified,
it means that Cassandra can't really optimise that query in any way and
it's going to have to query a lot of data (all of it...) to satisfy the
result.
Because you've only specified one attribute of the partitioning key,
Cassandra doesn't know where to look for that data, and will need to query
all of it to find partitions matching that restriction.

If you want to select distinct you should probably do it in a distributed
manner using token range scans, however this is generally not a good use
case for Cassandra. If you really need to know your partitioning keys you
should probably store them in a separate cache.