You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Brice Figureau <br...@daysofwonder.com> on 2015/10/17 17:15:18 UTC
Read query taking a long time
Hi,
I've read all I could find on how cassandra works, I'm still wondering why the following query takes more than 5s to return on a simple (and modest) 3 nodes cassandra 2.1.9 cluster:
SELECT sequence_nr, used
FROM messages
WHERE persistence_id = 'session-SW' AND partition_nr = 0;
The schema of this table:
CREATE TABLE akka.messages (
persistence_id text,
partition_nr bigint,
sequence_nr bigint,
message blob,
used boolean static,
PRIMARY KEY ((persistence_id, partition_nr), sequence_nr)
) WITH CLUSTERING ORDER BY (sequence_nr ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
This is a query from Akka cassandra journal (https://github.com/krasserm/akka-persistence-cassandra) which returns 33 rows for the moment.
The time doesn't vary if I use a QUORUM consistency or ONE.
Note that the same query but with a different `persistence_id` are much faster (it might depend on the number of rows returned).
Here's the (redacted) trace taken with cqlsh:
activity | timestamp | source | source_elapsed
---------------------------------------------------------------------------------------------------+----------------------------+----------------+----------------
Execute CQL3 query | 2015-10-17 17:01:05.681000 | 192.168.168.26 | 0
Parsing .................................................................. [SharedPool-Worker-1] | 2015-10-17 17:01:05.683000 | 192.168.168.26 | 84
READ message received from /192.168.168.26 [MessagingService-Incoming-/192.168.168.26] | 2015-10-17 17:01:05.683000 | 192.168.168.29 | 69
Preparing statement [SharedPool-Worker-1] | 2015-10-17 17:01:05.683000 | 192.168.168.26 | 215
reading data from /192.168.168.29 [SharedPool-Worker-1] | 2015-10-17 17:01:05.683000 | 192.168.168.26 | 1189
Sending READ message to /192.168.168.29 [MessagingService-Outgoing-/192.168.168.29] | 2015-10-17 17:01:05.683000 | 192.168.168.26 | 1317
Executing single-partition query on messages [SharedPool-Worker-1] | 2015-10-17 17:01:05.684000 | 192.168.168.29 | 175
Acquiring sstable references [SharedPool-Worker-1] | 2015-10-17 17:01:05.684000 | 192.168.168.29 | 189
Merging memtable tombstones [SharedPool-Worker-1] | 2015-10-17 17:01:05.685000 | 192.168.168.29 | 204
Key cache hit for sstable 434 [SharedPool-Worker-1] | 2015-10-17 17:01:05.685000 | 192.168.168.29 | 257
Seeking to partition beginning in data file [SharedPool-Worker-1] | 2015-10-17 17:01:05.685000 | 192.168.168.29 | 269
Key cache hit for sstable 432 [SharedPool-Worker-1] | 2015-10-17 17:01:05.685000 | 192.168.168.29 | 514
Seeking to partition beginning in data file [SharedPool-Worker-1] | 2015-10-17 17:01:05.686000 | 192.168.168.29 | 527
Key cache hit for sstable 433 [SharedPool-Worker-1] | 2015-10-17 17:01:05.686000 | 192.168.168.29 | 779
Seeking to partition beginning in data file [SharedPool-Worker-1] | 2015-10-17 17:01:05.686000 | 192.168.168.29 | 789
Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-1] | 2015-10-17 17:01:05.686000 | 192.168.168.29 | 929
Merging data from memtables and 3 sstables [SharedPool-Worker-1] | 2015-10-17 17:01:05.687000 | 192.168.168.29 | 956
speculating read retry on /192.168.168.26 [SharedPool-Worker-1] | 2015-10-17 17:01:05.724000 | 192.168.168.26 | 42423
Sending READ message to /192.168.168.26 [MessagingService-Outgoing-/192.168.168.26] | 2015-10-17 17:01:05.724000 | 192.168.168.26 | 42573
READ message received from /192.168.168.26 [MessagingService-Incoming-/192.168.168.26] | 2015-10-17 17:01:05.724000 | 192.168.168.26 | 42763
Executing single-partition query on messages [SharedPool-Worker-2] | 2015-10-17 17:01:05.724000 | 192.168.168.26 | 42855
Acquiring sstable references [SharedPool-Worker-2] | 2015-10-17 17:01:05.725000 | 192.168.168.26 | 42868
Merging memtable tombstones [SharedPool-Worker-2] | 2015-10-17 17:01:05.725000 | 192.168.168.26 | 42934
Bloom filter allows skipping sstable 443 [SharedPool-Worker-2] | 2015-10-17 17:01:05.725000 | 192.168.168.26 | 42965
Key cache hit for sstable 442 [SharedPool-Worker-2] | 2015-10-17 17:01:05.725000 | 192.168.168.26 | 42986
Seeking to partition beginning in data file [SharedPool-Worker-2] | 2015-10-17 17:01:05.725000 | 192.168.168.26 | 42991
Skipped 0/2 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-2] | 2015-10-17 17:01:05.726000 | 192.168.168.26 | 43115
Merging data from memtables and 1 sstables [SharedPool-Worker-2] | 2015-10-17 17:01:05.726000 | 192.168.168.26 | 43137
Read 33 live and 0 tombstone cells [SharedPool-Worker-2] | 2015-10-17 17:01:11.239000 | 192.168.168.26 | 557248
Enqueuing response to /192.168.168.26 [SharedPool-Worker-2] | 2015-10-17 17:01:11.244000 | 192.168.168.26 | 562973
Sending REQUEST_RESPONSE message to /192.168.168.26 [MessagingService-Outgoing-/192.168.168.26] | 2015-10-17 17:01:11.245000 | 192.168.168.26 | 563209
REQUEST_RESPONSE message received from /192.168.168.26 [MessagingService-Incoming-/192.168.168.26] | 2015-10-17 17:01:11.302000 | 192.168.168.26 | 620575
Processing response from /192.168.168.26 [SharedPool-Worker-2] | 2015-10-17 17:01:11.302000 | 192.168.168.26 | 620792
Read 33 live and 0 tombstone cells [SharedPool-Worker-1] | 2015-10-17 17:01:11.330000 | 192.168.168.29 | 646481
Enqueuing response to /192.168.168.26 [SharedPool-Worker-1] | 2015-10-17 17:01:11.335000 | 192.168.168.29 | 651628
Sending REQUEST_RESPONSE message to /192.168.168.26 [MessagingService-Outgoing-/192.168.168.26] | 2015-10-17 17:01:11.335000 | 192.168.168.29 | 651769
Request complete | 2015-10-17 17:01:11.312485 | 192.168.168.26 | 631485
I'm not sure I read correctly the trace, but it seems the time is taken on the:
"Read 33 live and 0 tombstone cells"
What does this step do?
How can I troubleshoot it?
The cluster configuration is the default one, with the following changes: concurrent_reads and writes at 16 (because it's a very modest hardware, but I used 32 without any differences), increased timeouts, and using hsha.
Note that I had to increase the timeouts for the request to succeed and be able to trace it (otherwise it was always timing out).
Note that the same request on the same data loaded from a snapshot on a 1-node cassandra cluster running in a VM on my laptop doesn't exhibit the same issue (it's instantaneous).
Also, a non-related question: in what unit is the column "source_elapsed"?
And is the "timestamp" taken at the start of the operation or at the end?
Thanks for any answers,
--
Brice Figureau
Re: Read query taking a long time
Posted by Carlos Alonso <in...@mrcalonso.com>.
Well...
I think that pretty much is showing the problem. The problem I'd say is a
bad data model. Your read query is perfect, it hits a single partition and
that's the best situation, but on the other hand, it turns out that there
are one or two huge partitions and fully reading them is extremely
expensive. The tail of all you cfhistograms are very long. ie: the
difference between 99% and the max is very big
As you can see, the 99% of your partitions are within a reasonable 454Kb,
but it turns out that the maximum goes above 10Mb. Reviewing your data
model I have two ideas:
1. Either you have very wide rows (ie, lots of records for the same
partition key)
2. Some of your blob data is huge. I kind of think the problem is here, as
the difference in cells between the 99% and the max is very small while the
data size difference is huge, so partitions are not very big (ie, not very
wide rows), but very heavy weighting (lots of data).
Does that make sense?
Carlos Alonso | Software Engineer | @calonso <https://twitter.com/calonso>
On 21 October 2015 at 08:35, Brice Figureau <
brice+cassandra@daysofwonder.com> wrote:
> Hi,
>
> On 20/10/2015 19:48, Carlos Alonso wrote:
> > I think also having the output of cfhistograms could help. I'd like to
> > know how many sstables are being hit during reads.
>
> Ooops, my bad I thought you mentioned proxyhistograms.
> Here's the worst node cfhistograms:
>
> nodetool cfhistograms akka messages
> akka/messages histograms
> Percentile SSTables Write Latency Read Latency Partition
> Size Cell Count
> (micros) (micros) (bytes)
> 50% 1.00 60.00 310.00
> 535 1
> 75% 2.00 103.00 9887.00
> 11864 17
> 95% 3.00 14237.00 126934.00
> 73457 215
> 98% 3.00 51012.00 219342.00
> 263210 446
> 99% 3.00 61214.00 1131752.00
> 454826 924
> Min 0.00 6.00 18.00
> 73 0
> Max 6.00 263210.00 5839588.00
> 10090808 1109
>
>
> > Also, which CL are you reading with?
>
> For the test request in cqlsh, I'm using ONE, but the application in which
> I first observed the issue was using QUORUM.
>
> > cfstats is a local command, so maybe that node you've printed is working
> > fine but there's another that is causing the latency. Can you check that
> > command in all nodes?
>
> I checked and on all nodes, the read latency and read local latency are
> within 15 to 40ms.
>
> I also noticed that C* was taking a fair bit of CPU on some of the nodes
> (ranging from 60% to 200%), looking at ttop output it was mostly taken by
> SharedPool-Worker threads, which I assume are the thread that are doing the
> real query work.
>
> Well, I'm puzzled, and I'll keep searching, thanks for your help!
> --
> Brice Figureau
>
Re: Read query taking a long time
Posted by "Laing, Michael" <mi...@nytimes.com>.
Are the clocks synchronized across the cluster - probably, but I thought I
would ask :)
On Wed, Oct 21, 2015 at 3:35 AM, Brice Figureau <
brice+cassandra@daysofwonder.com> wrote:
> Hi,
>
> On 20/10/2015 19:48, Carlos Alonso wrote:
> > I think also having the output of cfhistograms could help. I'd like to
> > know how many sstables are being hit during reads.
>
> Ooops, my bad I thought you mentioned proxyhistograms.
> Here's the worst node cfhistograms:
>
> nodetool cfhistograms akka messages
> akka/messages histograms
> Percentile SSTables Write Latency Read Latency Partition
> Size Cell Count
> (micros) (micros) (bytes)
> 50% 1.00 60.00 310.00
> 535 1
> 75% 2.00 103.00 9887.00
> 11864 17
> 95% 3.00 14237.00 126934.00
> 73457 215
> 98% 3.00 51012.00 219342.00
> 263210 446
> 99% 3.00 61214.00 1131752.00
> 454826 924
> Min 0.00 6.00 18.00
> 73 0
> Max 6.00 263210.00 5839588.00
> 10090808 1109
>
>
> > Also, which CL are you reading with?
>
> For the test request in cqlsh, I'm using ONE, but the application in which
> I first observed the issue was using QUORUM.
>
> > cfstats is a local command, so maybe that node you've printed is working
> > fine but there's another that is causing the latency. Can you check that
> > command in all nodes?
>
> I checked and on all nodes, the read latency and read local latency are
> within 15 to 40ms.
>
> I also noticed that C* was taking a fair bit of CPU on some of the nodes
> (ranging from 60% to 200%), looking at ttop output it was mostly taken by
> SharedPool-Worker threads, which I assume are the thread that are doing the
> real query work.
>
> Well, I'm puzzled, and I'll keep searching, thanks for your help!
> --
> Brice Figureau
>
Re: Read query taking a long time
Posted by Brice Figureau <br...@daysofwonder.com>.
Hi,
On 20/10/2015 19:48, Carlos Alonso wrote:
> I think also having the output of cfhistograms could help. I'd like to
> know how many sstables are being hit during reads.
Ooops, my bad I thought you mentioned proxyhistograms.
Here's the worst node cfhistograms:
nodetool cfhistograms akka messages
akka/messages histograms
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 1.00 60.00 310.00 535 1
75% 2.00 103.00 9887.00 11864 17
95% 3.00 14237.00 126934.00 73457 215
98% 3.00 51012.00 219342.00 263210 446
99% 3.00 61214.00 1131752.00 454826 924
Min 0.00 6.00 18.00 73 0
Max 6.00 263210.00 5839588.00 10090808 1109
> Also, which CL are you reading with?
For the test request in cqlsh, I'm using ONE, but the application in which I first observed the issue was using QUORUM.
> cfstats is a local command, so maybe that node you've printed is working
> fine but there's another that is causing the latency. Can you check that
> command in all nodes?
I checked and on all nodes, the read latency and read local latency are within 15 to 40ms.
I also noticed that C* was taking a fair bit of CPU on some of the nodes (ranging from 60% to 200%), looking at ttop output it was mostly taken by SharedPool-Worker threads, which I assume are the thread that are doing the real query work.
Well, I'm puzzled, and I'll keep searching, thanks for your help!
--
Brice Figureau
Re: Read query taking a long time
Posted by Carlos Alonso <in...@mrcalonso.com>.
I think also having the output of cfhistograms could help. I'd like to know
how many sstables are being hit during reads.
Also, which CL are you reading with?
cfstats is a local command, so maybe that node you've printed is working
fine but there's another that is causing the latency. Can you check that
command in all nodes?
Regards
Carlos Alonso | Software Engineer | @calonso <https://twitter.com/calonso>
On 20 October 2015 at 13:59, Brice Figureau <
brice+cassandra@daysofwonder.com> wrote:
> Hi,
>
> Thanks for your answer. Unfortunately since I wrote my e-mail, things
> are a bit better.
>
> This might be because I moved from openjdk 7 to oracle jdk 8 after
> having seen a warning in the C* log about openjdk, and I also added a
> node (for other reasons).
>
> Now the query itself takes only 1.5s~2s instead of the 5s~6s it was
> taking before.
>
> On Mon, 2015-10-19 at 14:38 +0100, Carlos Alonso wrote:
> > Could you send cfhistograms and cfstats relevant to the read column
> > family?
>
> Here are the requested informatrion
> % nodetool proxyhistograms
> proxy histograms
> Percentile Read Latency Write Latency Range Latency
> (micros) (micros) (micros)
> 50% 1109.00 372.00 1916.00
> 75% 14237.00 535.00 2759.00
> 95% 105778.00 642.00 4768.00
> 98% 545791.00 770.00 11864.00
> 99% 785939.00 924.00 14237.00
> Min 73.00 0.00 373.00
> Max 5839588.00 88148.00 73457.00
>
> % nodetool cfstats akka.messages
> Keyspace: akka
> Read Count: 3334784
> Read Latency: 9.98472696792356 ms.
> Write Count: 7124
> Write Latency: 0.572256457046603 ms.
> Pending Flushes: 0
> Table: messages
> SSTable count: 1
> Space used (live): 4680841
> Space used (total): 4680841
> Space used by snapshots (total): 23615746
> Off heap memory used (total): 4051
> SSTable Compression Ratio: 0.17318784636027024
> Number of keys (estimate): 478
> Memtable cell count: 317
> Memtable data size: 42293
> Memtable off heap memory used: 0
> Memtable switch count: 10
> Local read count: 3334784
> Local read latency: 9.985 ms
> Local write count: 7124
> Local write latency: 0.573 ms
> Pending flushes: 0
> Bloom filter false positives: 0
> Bloom filter false ratio: 0.00000
> Bloom filter space used: 592
> Bloom filter off heap memory used: 584
> Index summary off heap memory used: 203
> Compression metadata off heap memory used: 3264
> Compacted partition minimum bytes: 73
> Compacted partition maximum bytes: 17436917
> Compacted partition mean bytes: 63810
> Average live cells per slice (last five minutes):
> 0.6693421039216356
> Maximum live cells per slice (last five minutes): 1033.0
> Average tombstones per slice (last five minutes): 0.0
> Maximum tombstones per slice (last five minutes): 0.0
>
> ----------------
>
> If I read correctly this, there's a huge read latency while proxying,
> but local read latency, or even all node latency on this table is
> correct.
>
> Would that mean this is a network issue?
> --
> Brice Figureau <br...@daysofwonder.com>
>
>
Re: Read query taking a long time
Posted by Brice Figureau <br...@daysofwonder.com>.
Hi,
Thanks for your answer. Unfortunately since I wrote my e-mail, things
are a bit better.
This might be because I moved from openjdk 7 to oracle jdk 8 after
having seen a warning in the C* log about openjdk, and I also added a
node (for other reasons).
Now the query itself takes only 1.5s~2s instead of the 5s~6s it was
taking before.
On Mon, 2015-10-19 at 14:38 +0100, Carlos Alonso wrote:
> Could you send cfhistograms and cfstats relevant to the read column
> family?
Here are the requested informatrion
% nodetool proxyhistograms
proxy histograms
Percentile Read Latency Write Latency Range Latency
(micros) (micros) (micros)
50% 1109.00 372.00 1916.00
75% 14237.00 535.00 2759.00
95% 105778.00 642.00 4768.00
98% 545791.00 770.00 11864.00
99% 785939.00 924.00 14237.00
Min 73.00 0.00 373.00
Max 5839588.00 88148.00 73457.00
% nodetool cfstats akka.messages
Keyspace: akka
Read Count: 3334784
Read Latency: 9.98472696792356 ms.
Write Count: 7124
Write Latency: 0.572256457046603 ms.
Pending Flushes: 0
Table: messages
SSTable count: 1
Space used (live): 4680841
Space used (total): 4680841
Space used by snapshots (total): 23615746
Off heap memory used (total): 4051
SSTable Compression Ratio: 0.17318784636027024
Number of keys (estimate): 478
Memtable cell count: 317
Memtable data size: 42293
Memtable off heap memory used: 0
Memtable switch count: 10
Local read count: 3334784
Local read latency: 9.985 ms
Local write count: 7124
Local write latency: 0.573 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 592
Bloom filter off heap memory used: 584
Index summary off heap memory used: 203
Compression metadata off heap memory used: 3264
Compacted partition minimum bytes: 73
Compacted partition maximum bytes: 17436917
Compacted partition mean bytes: 63810
Average live cells per slice (last five minutes):
0.6693421039216356
Maximum live cells per slice (last five minutes): 1033.0
Average tombstones per slice (last five minutes): 0.0
Maximum tombstones per slice (last five minutes): 0.0
----------------
If I read correctly this, there's a huge read latency while proxying,
but local read latency, or even all node latency on this table is
correct.
Would that mean this is a network issue?
--
Brice Figureau <br...@daysofwonder.com>
Re: Read query taking a long time
Posted by Jon Haddad <jo...@gmail.com>.
I wrote a blog post a while back you may find helpful on diagnosing problems in production. There's a lot of potential things that could be wrong with your cluster and going back and forth on the ML to pin down the right one will take forever.
http://rustyrazorblade.com/2014/09/cassandra-summit-recap-diagnosing-problems-in-production/ <http://rustyrazorblade.com/2014/09/cassandra-summit-recap-diagnosing-problems-in-production/>
Once you've figured out what's wrong (and fixed it), you should read Al's tuning guide:
https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html <https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html>
Jon
> On Oct 19, 2015, at 7:38 AM, Carlos Alonso <in...@mrcalonso.com> wrote:
>
> Could you send cfhistograms and cfstats relevant to the read column family?
>
> That could help
>
> Carlos Alonso | Software Engineer | @calonso <https://twitter.com/calonso>
>
> On 17 October 2015 at 16:15, Brice Figureau <brice+cassandra@daysofwonder.com <ma...@daysofwonder.com>> wrote:
> Hi,
>
> I've read all I could find on how cassandra works, I'm still wondering why the following query takes more than 5s to return on a simple (and modest) 3 nodes cassandra 2.1.9 cluster:
>
> SELECT sequence_nr, used
> FROM messages
> WHERE persistence_id = 'session-SW' AND partition_nr = 0;
>
> The schema of this table:
>
> CREATE TABLE akka.messages (
> persistence_id text,
> partition_nr bigint,
> sequence_nr bigint,
> message blob,
> used boolean static,
> PRIMARY KEY ((persistence_id, partition_nr), sequence_nr)
> ) WITH CLUSTERING ORDER BY (sequence_nr ASC)
> AND bloom_filter_fp_chance = 0.01
> AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
> AND comment = ''
> AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
> AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
> AND dclocal_read_repair_chance = 0.1
> AND default_time_to_live = 0
> AND gc_grace_seconds = 864000
> AND max_index_interval = 2048
> AND memtable_flush_period_in_ms = 0
> AND min_index_interval = 128
> AND read_repair_chance = 0.0
> AND speculative_retry = '99.0PERCENTILE';
>
>
> This is a query from Akka cassandra journal (https://github.com/krasserm/akka-persistence-cassandra <https://github.com/krasserm/akka-persistence-cassandra>) which returns 33 rows for the moment.
>
> The time doesn't vary if I use a QUORUM consistency or ONE.
> Note that the same query but with a different `persistence_id` are much faster (it might depend on the number of rows returned).
>
> Here's the (redacted) trace taken with cqlsh:
>
> activity | timestamp | source | source_elapsed
> ---------------------------------------------------------------------------------------------------+----------------------------+----------------+----------------
> Execute CQL3 query | 2015-10-17 17:01:05.681000 | 192.168.168.26 | 0
> Parsing .................................................................. [SharedPool-Worker-1] | 2015-10-17 17:01:05.683000 | 192.168.168.26 | 84
> READ message received from /192.168.168.26 <http://192.168.168.26/> [MessagingService-Incoming-/192.168.168.26 <http://192.168.168.26/>] | 2015-10-17 17:01:05.683000 | 192.168.168.29 | 69
> Preparing statement [SharedPool-Worker-1] | 2015-10-17 17:01:05.683000 | 192.168.168.26 | 215
> reading data from /192.168.168.29 <http://192.168.168.29/> [SharedPool-Worker-1] | 2015-10-17 17:01:05.683000 | 192.168.168.26 | 1189
> Sending READ message to /192.168.168.29 <http://192.168.168.29/> [MessagingService-Outgoing-/192.168.168.29 <http://192.168.168.29/>] | 2015-10-17 17:01:05.683000 | 192.168.168.26 | 1317
> Executing single-partition query on messages [SharedPool-Worker-1] | 2015-10-17 17:01:05.684000 | 192.168.168.29 | 175
> Acquiring sstable references [SharedPool-Worker-1] | 2015-10-17 17:01:05.684000 | 192.168.168.29 | 189
> Merging memtable tombstones [SharedPool-Worker-1] | 2015-10-17 17:01:05.685000 | 192.168.168.29 | 204
> Key cache hit for sstable 434 [SharedPool-Worker-1] | 2015-10-17 17:01:05.685000 | 192.168.168.29 | 257
> Seeking to partition beginning in data file [SharedPool-Worker-1] | 2015-10-17 17:01:05.685000 | 192.168.168.29 | 269
> Key cache hit for sstable 432 [SharedPool-Worker-1] | 2015-10-17 17:01:05.685000 | 192.168.168.29 | 514
> Seeking to partition beginning in data file [SharedPool-Worker-1] | 2015-10-17 17:01:05.686000 | 192.168.168.29 | 527
> Key cache hit for sstable 433 [SharedPool-Worker-1] | 2015-10-17 17:01:05.686000 | 192.168.168.29 | 779
> Seeking to partition beginning in data file [SharedPool-Worker-1] | 2015-10-17 17:01:05.686000 | 192.168.168.29 | 789
> Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-1] | 2015-10-17 17:01:05.686000 | 192.168.168.29 | 929
> Merging data from memtables and 3 sstables [SharedPool-Worker-1] | 2015-10-17 17:01:05.687000 | 192.168.168.29 | 956
> speculating read retry on /192.168.168.26 <http://192.168.168.26/> [SharedPool-Worker-1] | 2015-10-17 17:01:05.724000 | 192.168.168.26 | 42423
> Sending READ message to /192.168.168.26 <http://192.168.168.26/> [MessagingService-Outgoing-/192.168.168.26 <http://192.168.168.26/>] | 2015-10-17 17:01:05.724000 | 192.168.168.26 | 42573
> READ message received from /192.168.168.26 <http://192.168.168.26/> [MessagingService-Incoming-/192.168.168.26 <http://192.168.168.26/>] | 2015-10-17 17:01:05.724000 | 192.168.168.26 | 42763
> Executing single-partition query on messages [SharedPool-Worker-2] | 2015-10-17 17:01:05.724000 | 192.168.168.26 | 42855
> Acquiring sstable references [SharedPool-Worker-2] | 2015-10-17 17:01:05.725000 | 192.168.168.26 | 42868
> Merging memtable tombstones [SharedPool-Worker-2] | 2015-10-17 17:01:05.725000 | 192.168.168.26 | 42934
> Bloom filter allows skipping sstable 443 [SharedPool-Worker-2] | 2015-10-17 17:01:05.725000 | 192.168.168.26 | 42965
> Key cache hit for sstable 442 [SharedPool-Worker-2] | 2015-10-17 17:01:05.725000 | 192.168.168.26 | 42986
> Seeking to partition beginning in data file [SharedPool-Worker-2] | 2015-10-17 17:01:05.725000 | 192.168.168.26 | 42991
> Skipped 0/2 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-2] | 2015-10-17 17:01:05.726000 | 192.168.168.26 | 43115
> Merging data from memtables and 1 sstables [SharedPool-Worker-2] | 2015-10-17 17:01:05.726000 | 192.168.168.26 | 43137
> Read 33 live and 0 tombstone cells [SharedPool-Worker-2] | 2015-10-17 17:01:11.239000 | 192.168.168.26 | 557248
> Enqueuing response to /192.168.168.26 <http://192.168.168.26/> [SharedPool-Worker-2] | 2015-10-17 17:01:11.244000 | 192.168.168.26 | 562973
> Sending REQUEST_RESPONSE message to /192.168.168.26 <http://192.168.168.26/> [MessagingService-Outgoing-/192.168.168.26 <http://192.168.168.26/>] | 2015-10-17 17:01:11.245000 | 192.168.168.26 | 563209
> REQUEST_RESPONSE message received from /192.168.168.26 <http://192.168.168.26/> [MessagingService-Incoming-/192.168.168.26 <http://192.168.168.26/>] | 2015-10-17 17:01:11.302000 | 192.168.168.26 | 620575
> Processing response from /192.168.168.26 <http://192.168.168.26/> [SharedPool-Worker-2] | 2015-10-17 17:01:11.302000 | 192.168.168.26 | 620792
> Read 33 live and 0 tombstone cells [SharedPool-Worker-1] | 2015-10-17 17:01:11.330000 | 192.168.168.29 | 646481
> Enqueuing response to /192.168.168.26 <http://192.168.168.26/> [SharedPool-Worker-1] | 2015-10-17 17:01:11.335000 | 192.168.168.29 | 651628
> Sending REQUEST_RESPONSE message to /192.168.168.26 <http://192.168.168.26/> [MessagingService-Outgoing-/192.168.168.26 <http://192.168.168.26/>] | 2015-10-17 17:01:11.335000 | 192.168.168.29 | 651769
> Request complete | 2015-10-17 17:01:11.312485 | 192.168.168.26 | 631485
>
>
> I'm not sure I read correctly the trace, but it seems the time is taken on the:
> "Read 33 live and 0 tombstone cells"
>
> What does this step do?
> How can I troubleshoot it?
>
> The cluster configuration is the default one, with the following changes: concurrent_reads and writes at 16 (because it's a very modest hardware, but I used 32 without any differences), increased timeouts, and using hsha.
>
> Note that I had to increase the timeouts for the request to succeed and be able to trace it (otherwise it was always timing out).
>
> Note that the same request on the same data loaded from a snapshot on a 1-node cassandra cluster running in a VM on my laptop doesn't exhibit the same issue (it's instantaneous).
>
> Also, a non-related question: in what unit is the column "source_elapsed"?
> And is the "timestamp" taken at the start of the operation or at the end?
>
> Thanks for any answers,
> --
> Brice Figureau
>
Re: Read query taking a long time
Posted by Carlos Alonso <in...@mrcalonso.com>.
Could you send cfhistograms and cfstats relevant to the read column family?
That could help
Carlos Alonso | Software Engineer | @calonso <https://twitter.com/calonso>
On 17 October 2015 at 16:15, Brice Figureau <
brice+cassandra@daysofwonder.com> wrote:
> Hi,
>
> I've read all I could find on how cassandra works, I'm still wondering why
> the following query takes more than 5s to return on a simple (and modest) 3
> nodes cassandra 2.1.9 cluster:
>
> SELECT sequence_nr, used
> FROM messages
> WHERE persistence_id = 'session-SW' AND partition_nr = 0;
>
> The schema of this table:
>
> CREATE TABLE akka.messages (
> persistence_id text,
> partition_nr bigint,
> sequence_nr bigint,
> message blob,
> used boolean static,
> PRIMARY KEY ((persistence_id, partition_nr), sequence_nr)
> ) WITH CLUSTERING ORDER BY (sequence_nr ASC)
> AND bloom_filter_fp_chance = 0.01
> AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
> AND comment = ''
> AND compaction = {'class':
> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
> AND compression = {'sstable_compression':
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
> AND dclocal_read_repair_chance = 0.1
> AND default_time_to_live = 0
> AND gc_grace_seconds = 864000
> AND max_index_interval = 2048
> AND memtable_flush_period_in_ms = 0
> AND min_index_interval = 128
> AND read_repair_chance = 0.0
> AND speculative_retry = '99.0PERCENTILE';
>
>
> This is a query from Akka cassandra journal (
> https://github.com/krasserm/akka-persistence-cassandra) which returns 33
> rows for the moment.
>
> The time doesn't vary if I use a QUORUM consistency or ONE.
> Note that the same query but with a different `persistence_id` are much
> faster (it might depend on the number of rows returned).
>
> Here's the (redacted) trace taken with cqlsh:
>
> activity
> | timestamp | source |
> source_elapsed
>
> ---------------------------------------------------------------------------------------------------+----------------------------+----------------+----------------
>
> Execute CQL3 query | 2015-10-17 17:01:05.681000 | 192.168.168.26 |
> 0
> Parsing
> ..................................................................
> [SharedPool-Worker-1] | 2015-10-17 17:01:05.683000 | 192.168.168.26 |
> 84
> READ message received from /192.168.168.26
> [MessagingService-Incoming-/192.168.168.26] | 2015-10-17 17:01:05.683000
> | 192.168.168.29 | 69
> Preparing
> statement [SharedPool-Worker-1] | 2015-10-17 17:01:05.683000 |
> 192.168.168.26 | 215
> reading data from /
> 192.168.168.29 [SharedPool-Worker-1] | 2015-10-17 17:01:05.683000 |
> 192.168.168.26 | 1189
> Sending READ message to /192.168.168.29
> [MessagingService-Outgoing-/192.168.168.29] | 2015-10-17 17:01:05.683000
> | 192.168.168.26 | 1317
> Executing single-partition query on
> messages [SharedPool-Worker-1] | 2015-10-17 17:01:05.684000 |
> 192.168.168.29 | 175
> Acquiring sstable
> references [SharedPool-Worker-1] | 2015-10-17 17:01:05.684000 |
> 192.168.168.29 | 189
> Merging memtable
> tombstones [SharedPool-Worker-1] | 2015-10-17 17:01:05.685000 |
> 192.168.168.29 | 204
> Key cache hit for sstable
> 434 [SharedPool-Worker-1] | 2015-10-17 17:01:05.685000 | 192.168.168.29 |
> 257
> Seeking to partition beginning in data
> file [SharedPool-Worker-1] | 2015-10-17 17:01:05.685000 | 192.168.168.29 |
> 269
> Key cache hit for sstable
> 432 [SharedPool-Worker-1] | 2015-10-17 17:01:05.685000 | 192.168.168.29 |
> 514
> Seeking to partition beginning in data
> file [SharedPool-Worker-1] | 2015-10-17 17:01:05.686000 | 192.168.168.29 |
> 527
> Key cache hit for sstable
> 433 [SharedPool-Worker-1] | 2015-10-17 17:01:05.686000 | 192.168.168.29 |
> 779
> Seeking to partition beginning in data
> file [SharedPool-Worker-1] | 2015-10-17 17:01:05.686000 | 192.168.168.29 |
> 789
> Skipped 0/3 non-slice-intersecting sstables, included 0 due to
> tombstones [SharedPool-Worker-1] | 2015-10-17 17:01:05.686000 |
> 192.168.168.29 | 929
> Merging data from memtables and 3
> sstables [SharedPool-Worker-1] | 2015-10-17 17:01:05.687000 |
> 192.168.168.29 | 956
> speculating read retry on /
> 192.168.168.26 [SharedPool-Worker-1] | 2015-10-17 17:01:05.724000 |
> 192.168.168.26 | 42423
> Sending READ message to /192.168.168.26
> [MessagingService-Outgoing-/192.168.168.26] | 2015-10-17 17:01:05.724000
> | 192.168.168.26 | 42573
> READ message received from /192.168.168.26
> [MessagingService-Incoming-/192.168.168.26] | 2015-10-17 17:01:05.724000
> | 192.168.168.26 | 42763
> Executing single-partition query on
> messages [SharedPool-Worker-2] | 2015-10-17 17:01:05.724000 |
> 192.168.168.26 | 42855
> Acquiring sstable
> references [SharedPool-Worker-2] | 2015-10-17 17:01:05.725000 |
> 192.168.168.26 | 42868
> Merging memtable
> tombstones [SharedPool-Worker-2] | 2015-10-17 17:01:05.725000 |
> 192.168.168.26 | 42934
> Bloom filter allows skipping sstable
> 443 [SharedPool-Worker-2] | 2015-10-17 17:01:05.725000 | 192.168.168.26 |
> 42965
> Key cache hit for sstable
> 442 [SharedPool-Worker-2] | 2015-10-17 17:01:05.725000 | 192.168.168.26 |
> 42986
> Seeking to partition beginning in data
> file [SharedPool-Worker-2] | 2015-10-17 17:01:05.725000 | 192.168.168.26 |
> 42991
> Skipped 0/2 non-slice-intersecting sstables, included 0 due to
> tombstones [SharedPool-Worker-2] | 2015-10-17 17:01:05.726000 |
> 192.168.168.26 | 43115
> Merging data from memtables and 1
> sstables [SharedPool-Worker-2] | 2015-10-17 17:01:05.726000 |
> 192.168.168.26 | 43137
> Read 33 live and 0 tombstone
> cells [SharedPool-Worker-2] | 2015-10-17 17:01:11.239000 | 192.168.168.26
> | 557248
> Enqueuing response to /
> 192.168.168.26 [SharedPool-Worker-2] | 2015-10-17 17:01:11.244000 |
> 192.168.168.26 | 562973
> Sending REQUEST_RESPONSE message to /192.168.168.26
> [MessagingService-Outgoing-/192.168.168.26] | 2015-10-17 17:01:11.245000
> | 192.168.168.26 | 563209
> REQUEST_RESPONSE message received from /192.168.168.26
> [MessagingService-Incoming-/192.168.168.26] | 2015-10-17 17:01:11.302000
> | 192.168.168.26 | 620575
> Processing response from /
> 192.168.168.26 [SharedPool-Worker-2] | 2015-10-17 17:01:11.302000 |
> 192.168.168.26 | 620792
> Read 33 live and 0 tombstone
> cells [SharedPool-Worker-1] | 2015-10-17 17:01:11.330000 | 192.168.168.29
> | 646481
> Enqueuing response to /
> 192.168.168.26 [SharedPool-Worker-1] | 2015-10-17 17:01:11.335000 |
> 192.168.168.29 | 651628
> Sending REQUEST_RESPONSE message to /192.168.168.26
> [MessagingService-Outgoing-/192.168.168.26] | 2015-10-17 17:01:11.335000
> | 192.168.168.29 | 651769
>
> Request complete | 2015-10-17 17:01:11.312485 | 192.168.168.26 |
> 631485
>
>
> I'm not sure I read correctly the trace, but it seems the time is taken on
> the:
> "Read 33 live and 0 tombstone cells"
>
> What does this step do?
> How can I troubleshoot it?
>
> The cluster configuration is the default one, with the following changes:
> concurrent_reads and writes at 16 (because it's a very modest hardware, but
> I used 32 without any differences), increased timeouts, and using hsha.
>
> Note that I had to increase the timeouts for the request to succeed and be
> able to trace it (otherwise it was always timing out).
>
> Note that the same request on the same data loaded from a snapshot on a
> 1-node cassandra cluster running in a VM on my laptop doesn't exhibit the
> same issue (it's instantaneous).
>
> Also, a non-related question: in what unit is the column "source_elapsed"?
> And is the "timestamp" taken at the start of the operation or at the end?
>
> Thanks for any answers,
> --
> Brice Figureau
>