You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Brice Figureau <br...@daysofwonder.com> on 2015/10/17 17:15:18 UTC

Read query taking a long time

Hi,

I've read all I could find on how cassandra works, I'm still wondering why the following query takes more than 5s to return on a simple (and modest) 3 nodes cassandra 2.1.9 cluster:

SELECT sequence_nr, used 
FROM messages 
WHERE persistence_id = 'session-SW' AND partition_nr = 0;

The schema of this table:

CREATE TABLE akka.messages (
    persistence_id text,
    partition_nr bigint,
    sequence_nr bigint,
    message blob,
    used boolean static,
    PRIMARY KEY ((persistence_id, partition_nr), sequence_nr)
) WITH CLUSTERING ORDER BY (sequence_nr ASC)
    AND bloom_filter_fp_chance = 0.01
    AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99.0PERCENTILE';


This is a query from Akka cassandra journal (https://github.com/krasserm/akka-persistence-cassandra) which returns 33 rows for the moment.

The time doesn't vary if I use a QUORUM consistency or ONE.
Note that the same query but with a different `persistence_id` are much faster (it might depend on the number of rows returned).

Here's the (redacted) trace taken with cqlsh:

activity                                                                                          | timestamp                  | source         | source_elapsed
---------------------------------------------------------------------------------------------------+----------------------------+----------------+----------------
                                                                                Execute CQL3 query | 2015-10-17 17:01:05.681000 | 192.168.168.26 |              0
  Parsing .................................................................. [SharedPool-Worker-1] | 2015-10-17 17:01:05.683000 | 192.168.168.26 |             84
            READ message received from /192.168.168.26 [MessagingService-Incoming-/192.168.168.26] | 2015-10-17 17:01:05.683000 | 192.168.168.29 |             69
                                                         Preparing statement [SharedPool-Worker-1] | 2015-10-17 17:01:05.683000 | 192.168.168.26 |            215
                                           reading data from /192.168.168.29 [SharedPool-Worker-1] | 2015-10-17 17:01:05.683000 | 192.168.168.26 |           1189
               Sending READ message to /192.168.168.29 [MessagingService-Outgoing-/192.168.168.29] | 2015-10-17 17:01:05.683000 | 192.168.168.26 |           1317
                                Executing single-partition query on messages [SharedPool-Worker-1] | 2015-10-17 17:01:05.684000 | 192.168.168.29 |            175
                                                Acquiring sstable references [SharedPool-Worker-1] | 2015-10-17 17:01:05.684000 | 192.168.168.29 |            189
                                                 Merging memtable tombstones [SharedPool-Worker-1] | 2015-10-17 17:01:05.685000 | 192.168.168.29 |            204
                                               Key cache hit for sstable 434 [SharedPool-Worker-1] | 2015-10-17 17:01:05.685000 | 192.168.168.29 |            257
                                 Seeking to partition beginning in data file [SharedPool-Worker-1] | 2015-10-17 17:01:05.685000 | 192.168.168.29 |            269
                                               Key cache hit for sstable 432 [SharedPool-Worker-1] | 2015-10-17 17:01:05.685000 | 192.168.168.29 |            514
                                 Seeking to partition beginning in data file [SharedPool-Worker-1] | 2015-10-17 17:01:05.686000 | 192.168.168.29 |            527
                                               Key cache hit for sstable 433 [SharedPool-Worker-1] | 2015-10-17 17:01:05.686000 | 192.168.168.29 |            779
                                 Seeking to partition beginning in data file [SharedPool-Worker-1] | 2015-10-17 17:01:05.686000 | 192.168.168.29 |            789
   Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-1] | 2015-10-17 17:01:05.686000 | 192.168.168.29 |            929
                                  Merging data from memtables and 3 sstables [SharedPool-Worker-1] | 2015-10-17 17:01:05.687000 | 192.168.168.29 |            956
                                   speculating read retry on /192.168.168.26 [SharedPool-Worker-1] | 2015-10-17 17:01:05.724000 | 192.168.168.26 |          42423
               Sending READ message to /192.168.168.26 [MessagingService-Outgoing-/192.168.168.26] | 2015-10-17 17:01:05.724000 | 192.168.168.26 |          42573
            READ message received from /192.168.168.26 [MessagingService-Incoming-/192.168.168.26] | 2015-10-17 17:01:05.724000 | 192.168.168.26 |          42763
                                Executing single-partition query on messages [SharedPool-Worker-2] | 2015-10-17 17:01:05.724000 | 192.168.168.26 |          42855
                                                Acquiring sstable references [SharedPool-Worker-2] | 2015-10-17 17:01:05.725000 | 192.168.168.26 |          42868
                                                 Merging memtable tombstones [SharedPool-Worker-2] | 2015-10-17 17:01:05.725000 | 192.168.168.26 |          42934
                                    Bloom filter allows skipping sstable 443 [SharedPool-Worker-2] | 2015-10-17 17:01:05.725000 | 192.168.168.26 |          42965
                                               Key cache hit for sstable 442 [SharedPool-Worker-2] | 2015-10-17 17:01:05.725000 | 192.168.168.26 |          42986
                                 Seeking to partition beginning in data file [SharedPool-Worker-2] | 2015-10-17 17:01:05.725000 | 192.168.168.26 |          42991
   Skipped 0/2 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-2] | 2015-10-17 17:01:05.726000 | 192.168.168.26 |          43115
                                  Merging data from memtables and 1 sstables [SharedPool-Worker-2] | 2015-10-17 17:01:05.726000 | 192.168.168.26 |          43137
                                          Read 33 live and 0 tombstone cells [SharedPool-Worker-2] | 2015-10-17 17:01:11.239000 | 192.168.168.26 |         557248
                                       Enqueuing response to /192.168.168.26 [SharedPool-Worker-2] | 2015-10-17 17:01:11.244000 | 192.168.168.26 |         562973
   Sending REQUEST_RESPONSE message to /192.168.168.26 [MessagingService-Outgoing-/192.168.168.26] | 2015-10-17 17:01:11.245000 | 192.168.168.26 |         563209
REQUEST_RESPONSE message received from /192.168.168.26 [MessagingService-Incoming-/192.168.168.26] | 2015-10-17 17:01:11.302000 | 192.168.168.26 |         620575
                                    Processing response from /192.168.168.26 [SharedPool-Worker-2] | 2015-10-17 17:01:11.302000 | 192.168.168.26 |         620792
                                          Read 33 live and 0 tombstone cells [SharedPool-Worker-1] | 2015-10-17 17:01:11.330000 | 192.168.168.29 |         646481
                                       Enqueuing response to /192.168.168.26 [SharedPool-Worker-1] | 2015-10-17 17:01:11.335000 | 192.168.168.29 |         651628
   Sending REQUEST_RESPONSE message to /192.168.168.26 [MessagingService-Outgoing-/192.168.168.26] | 2015-10-17 17:01:11.335000 | 192.168.168.29 |         651769
                                                                                  Request complete | 2015-10-17 17:01:11.312485 | 192.168.168.26 |         631485


I'm not sure I read correctly the trace, but it seems the time is taken on the:
"Read 33 live and 0 tombstone cells"

What does this step do?
How can I troubleshoot it?

The cluster configuration is the default one, with the following changes: concurrent_reads and writes at 16 (because it's a very modest hardware, but I used 32 without any differences), increased timeouts, and using hsha.

Note that I had to increase the timeouts for the request to succeed and be able to trace it (otherwise it was always timing out).

Note that the same request on the same data loaded from a snapshot on a 1-node cassandra cluster running in a VM on my laptop doesn't exhibit the same issue (it's instantaneous).

Also, a non-related question: in what unit is the column "source_elapsed"?
And is the "timestamp" taken at the start of the operation or at the end?

Thanks for any answers,
-- 
Brice Figureau

Re: Read query taking a long time

Posted by Carlos Alonso <in...@mrcalonso.com>.

Well...

I think that pretty much is showing the problem. The problem I'd say is a
bad data model. Your read query is perfect, it hits a single partition and
that's the best situation, but on the other hand, it turns out that there
are one or two huge partitions and fully reading them is extremely
expensive. The tail of all you cfhistograms are very long. ie: the
difference between 99% and the max is very big

As you can see, the 99% of your partitions are within a reasonable 454Kb,
but it turns out that the maximum goes above 10Mb. Reviewing your data
model I have two ideas:

1. Either you have very wide rows (ie, lots of records for the same
partition key)
2. Some of your blob data is huge. I kind of think the problem is here, as
the difference in cells between the 99% and the max is very small while the
data size difference is huge, so partitions are not very big (ie, not very
wide rows), but very heavy weighting (lots of data).

Does that make sense?

Carlos Alonso | Software Engineer | @calonso <https://twitter.com/calonso>

On 21 October 2015 at 08:35, Brice Figureau <
brice+cassandra@daysofwonder.com> wrote:

> Hi,
>
> On 20/10/2015 19:48, Carlos Alonso wrote:
> > I think also having the output of cfhistograms could help. I'd like to
> > know how many sstables are being hit during reads.
>
> Ooops, my bad I thought you mentioned proxyhistograms.
> Here's the worst node cfhistograms:
>
> nodetool cfhistograms akka messages
> akka/messages histograms
> Percentile  SSTables     Write Latency      Read Latency    Partition
> Size        Cell Count
>                               (micros)          (micros)           (bytes)
> 50%             1.00             60.00            310.00
>  535                 1
> 75%             2.00            103.00           9887.00
>  11864                17
> 95%             3.00          14237.00         126934.00
>  73457               215
> 98%             3.00          51012.00         219342.00
> 263210               446
> 99%             3.00          61214.00        1131752.00
> 454826               924
> Min             0.00              6.00             18.00
> 73                 0
> Max             6.00         263210.00        5839588.00
> 10090808              1109
>
>
> > Also, which CL are you reading with?
>
> For the test request in cqlsh, I'm using ONE, but the application in which
> I first observed the issue was using QUORUM.
>
> > cfstats is a local command, so maybe that node you've printed is working
> > fine but there's another that is causing the latency. Can you check that
> > command in all nodes?
>
> I checked and on all nodes, the read latency and read local latency are
> within 15 to 40ms.
>
> I also noticed that C* was taking a fair bit of CPU on some of the nodes
> (ranging from 60% to 200%), looking at ttop output it was mostly taken by
> SharedPool-Worker threads, which I assume are the thread that are doing the
> real query work.
>
> Well, I'm puzzled, and I'll keep searching, thanks for your help!
> --
> Brice Figureau
>

Re: Read query taking a long time

Posted by "Laing, Michael" <mi...@nytimes.com>.

Are the clocks synchronized across the cluster - probably, but I thought I
would ask :)

On Wed, Oct 21, 2015 at 3:35 AM, Brice Figureau <
brice+cassandra@daysofwonder.com> wrote:

> Hi,
>
> On 20/10/2015 19:48, Carlos Alonso wrote:
> > I think also having the output of cfhistograms could help. I'd like to
> > know how many sstables are being hit during reads.
>
> Ooops, my bad I thought you mentioned proxyhistograms.
> Here's the worst node cfhistograms:
>
> nodetool cfhistograms akka messages
> akka/messages histograms
> Percentile  SSTables     Write Latency      Read Latency    Partition
> Size        Cell Count
>                               (micros)          (micros)           (bytes)
> 50%             1.00             60.00            310.00
>  535                 1
> 75%             2.00            103.00           9887.00
>  11864                17
> 95%             3.00          14237.00         126934.00
>  73457               215
> 98%             3.00          51012.00         219342.00
> 263210               446
> 99%             3.00          61214.00        1131752.00
> 454826               924
> Min             0.00              6.00             18.00
> 73                 0
> Max             6.00         263210.00        5839588.00
> 10090808              1109
>
>
> > Also, which CL are you reading with?
>
> For the test request in cqlsh, I'm using ONE, but the application in which
> I first observed the issue was using QUORUM.
>
> > cfstats is a local command, so maybe that node you've printed is working
> > fine but there's another that is causing the latency. Can you check that
> > command in all nodes?
>
> I checked and on all nodes, the read latency and read local latency are
> within 15 to 40ms.
>
> I also noticed that C* was taking a fair bit of CPU on some of the nodes
> (ranging from 60% to 200%), looking at ttop output it was mostly taken by
> SharedPool-Worker threads, which I assume are the thread that are doing the
> real query work.
>
> Well, I'm puzzled, and I'll keep searching, thanks for your help!
> --
> Brice Figureau
>

Re: Read query taking a long time

Posted by Brice Figureau <br...@daysofwonder.com>.

Hi,

On 20/10/2015 19:48, Carlos Alonso wrote:
> I think also having the output of cfhistograms could help. I'd like to
> know how many sstables are being hit during reads.

Ooops, my bad I thought you mentioned proxyhistograms.
Here's the worst node cfhistograms:

nodetool cfhistograms akka messages
akka/messages histograms
Percentile  SSTables     Write Latency      Read Latency    Partition Size        Cell Count
                              (micros)          (micros)           (bytes)
50%             1.00             60.00            310.00               535                 1
75%             2.00            103.00           9887.00             11864                17
95%             3.00          14237.00         126934.00             73457               215
98%             3.00          51012.00         219342.00            263210               446
99%             3.00          61214.00        1131752.00            454826               924
Min             0.00              6.00             18.00                73                 0
Max             6.00         263210.00        5839588.00          10090808              1109


> Also, which CL are you reading with?

For the test request in cqlsh, I'm using ONE, but the application in which I first observed the issue was using QUORUM.

> cfstats is a local command, so maybe that node you've printed is working
> fine but there's another that is causing the latency. Can you check that
> command in all nodes?

I checked and on all nodes, the read latency and read local latency are within 15 to 40ms.

I also noticed that C* was taking a fair bit of CPU on some of the nodes (ranging from 60% to 200%), looking at ttop output it was mostly taken by SharedPool-Worker threads, which I assume are the thread that are doing the real query work.

Well, I'm puzzled, and I'll keep searching, thanks for your help!
-- 
Brice Figureau

Re: Read query taking a long time

Posted by Carlos Alonso <in...@mrcalonso.com>.

I think also having the output of cfhistograms could help. I'd like to know
how many sstables are being hit during reads.

Also, which CL are you reading with?

cfstats is a local command, so maybe that node you've printed is working
fine but there's another that is causing the latency. Can you check that
command in all nodes?

Regards

Carlos Alonso | Software Engineer | @calonso <https://twitter.com/calonso>

On 20 October 2015 at 13:59, Brice Figureau <
brice+cassandra@daysofwonder.com> wrote:

> Hi,
>
> Thanks for your answer. Unfortunately since I wrote my e-mail, things
> are a bit better.
>
> This might be because I moved from openjdk 7 to oracle jdk 8 after
> having seen a warning in the C* log about openjdk, and I also added a
> node (for other reasons).
>
> Now the query itself takes only 1.5s~2s instead of the 5s~6s it was
> taking before.
>
> On Mon, 2015-10-19 at 14:38 +0100, Carlos Alonso wrote:
> > Could you send cfhistograms and cfstats relevant to the read column
> > family?
>
> Here are the requested informatrion
> % nodetool proxyhistograms
> proxy histograms
> Percentile      Read Latency     Write Latency     Range Latency
>                     (micros)          (micros)          (micros)
> 50%                  1109.00            372.00           1916.00
> 75%                 14237.00            535.00           2759.00
> 95%                105778.00            642.00           4768.00
> 98%                545791.00            770.00          11864.00
> 99%                785939.00            924.00          14237.00
> Min                    73.00              0.00            373.00
> Max               5839588.00          88148.00          73457.00
>
> % nodetool cfstats akka.messages
> Keyspace: akka
>         Read Count: 3334784
>         Read Latency: 9.98472696792356 ms.
>         Write Count: 7124
>         Write Latency: 0.572256457046603 ms.
>         Pending Flushes: 0
>                 Table: messages
>                 SSTable count: 1
>                 Space used (live): 4680841
>                 Space used (total): 4680841
>                 Space used by snapshots (total): 23615746
>                 Off heap memory used (total): 4051
>                 SSTable Compression Ratio: 0.17318784636027024
>                 Number of keys (estimate): 478
>                 Memtable cell count: 317
>                 Memtable data size: 42293
>                 Memtable off heap memory used: 0
>                 Memtable switch count: 10
>                 Local read count: 3334784
>                 Local read latency: 9.985 ms
>                 Local write count: 7124
>                 Local write latency: 0.573 ms
>                 Pending flushes: 0
>                 Bloom filter false positives: 0
>                 Bloom filter false ratio: 0.00000
>                 Bloom filter space used: 592
>                 Bloom filter off heap memory used: 584
>                 Index summary off heap memory used: 203
>                 Compression metadata off heap memory used: 3264
>                 Compacted partition minimum bytes: 73
>                 Compacted partition maximum bytes: 17436917
>                 Compacted partition mean bytes: 63810
>                 Average live cells per slice (last five minutes):
> 0.6693421039216356
>                 Maximum live cells per slice (last five minutes): 1033.0
>                 Average tombstones per slice (last five minutes): 0.0
>                 Maximum tombstones per slice (last five minutes): 0.0
>
> ----------------
>
> If I read correctly this, there's a huge read latency while proxying,
> but local read latency, or even all node latency on this table is
> correct.
>
> Would that mean this is a network issue?
> --
> Brice Figureau <br...@daysofwonder.com>
>
>

Re: Read query taking a long time

Posted by Brice Figureau <br...@daysofwonder.com>.

Hi,

Thanks for your answer. Unfortunately since I wrote my e-mail, things
are a bit better.

This might be because I moved from openjdk 7 to oracle jdk 8 after
having seen a warning in the C* log about openjdk, and I also added a
node (for other reasons).

Now the query itself takes only 1.5s~2s instead of the 5s~6s it was
taking before.

On Mon, 2015-10-19 at 14:38 +0100, Carlos Alonso wrote:
> Could you send cfhistograms and cfstats relevant to the read column
> family?

Here are the requested informatrion
% nodetool proxyhistograms
proxy histograms
Percentile      Read Latency     Write Latency     Range Latency
                    (micros)          (micros)          (micros)
50%                  1109.00            372.00           1916.00
75%                 14237.00            535.00           2759.00
95%                105778.00            642.00           4768.00
98%                545791.00            770.00          11864.00
99%                785939.00            924.00          14237.00
Min                    73.00              0.00            373.00
Max               5839588.00          88148.00          73457.00

% nodetool cfstats akka.messages
Keyspace: akka
        Read Count: 3334784
        Read Latency: 9.98472696792356 ms.
        Write Count: 7124
        Write Latency: 0.572256457046603 ms.
        Pending Flushes: 0
                Table: messages
                SSTable count: 1
                Space used (live): 4680841
                Space used (total): 4680841
                Space used by snapshots (total): 23615746
                Off heap memory used (total): 4051
                SSTable Compression Ratio: 0.17318784636027024
                Number of keys (estimate): 478
                Memtable cell count: 317
                Memtable data size: 42293
                Memtable off heap memory used: 0
                Memtable switch count: 10
                Local read count: 3334784
                Local read latency: 9.985 ms
                Local write count: 7124
                Local write latency: 0.573 ms
                Pending flushes: 0
                Bloom filter false positives: 0
                Bloom filter false ratio: 0.00000
                Bloom filter space used: 592
                Bloom filter off heap memory used: 584
                Index summary off heap memory used: 203
                Compression metadata off heap memory used: 3264
                Compacted partition minimum bytes: 73
                Compacted partition maximum bytes: 17436917
                Compacted partition mean bytes: 63810
                Average live cells per slice (last five minutes):
0.6693421039216356
                Maximum live cells per slice (last five minutes): 1033.0
                Average tombstones per slice (last five minutes): 0.0
                Maximum tombstones per slice (last five minutes): 0.0

----------------

If I read correctly this, there's a huge read latency while proxying,
but local read latency, or even all node latency on this table is
correct.

Would that mean this is a network issue?
-- 
Brice Figureau <br...@daysofwonder.com>

Re: Read query taking a long time

Posted by Jon Haddad <jo...@gmail.com>.

I wrote a blog post a while back you may find helpful on diagnosing problems in production.  There's a lot of potential things that could be wrong with your cluster and going back and forth on the ML to pin down the right one will take forever.

http://rustyrazorblade.com/2014/09/cassandra-summit-recap-diagnosing-problems-in-production/ <http://rustyrazorblade.com/2014/09/cassandra-summit-recap-diagnosing-problems-in-production/>

Once you've figured out what's wrong (and fixed it), you should read Al's tuning guide:

https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html <https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html>

Jon


> On Oct 19, 2015, at 7:38 AM, Carlos Alonso <in...@mrcalonso.com> wrote:
> 
> Could you send cfhistograms and cfstats relevant to the read column family?
> 
> That could help
> 
> Carlos Alonso | Software Engineer | @calonso <https://twitter.com/calonso>
> 
> On 17 October 2015 at 16:15, Brice Figureau <brice+cassandra@daysofwonder.com <ma...@daysofwonder.com>> wrote:
> Hi,
> 
> I've read all I could find on how cassandra works, I'm still wondering why the following query takes more than 5s to return on a simple (and modest) 3 nodes cassandra 2.1.9 cluster:
> 
> SELECT sequence_nr, used
> FROM messages
> WHERE persistence_id = 'session-SW' AND partition_nr = 0;
> 
> The schema of this table:
> 
> CREATE TABLE akka.messages (
>     persistence_id text,
>     partition_nr bigint,
>     sequence_nr bigint,
>     message blob,
>     used boolean static,
>     PRIMARY KEY ((persistence_id, partition_nr), sequence_nr)
> ) WITH CLUSTERING ORDER BY (sequence_nr ASC)
>     AND bloom_filter_fp_chance = 0.01
>     AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
>     AND comment = ''
>     AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
>     AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
>     AND dclocal_read_repair_chance = 0.1
>     AND default_time_to_live = 0
>     AND gc_grace_seconds = 864000
>     AND max_index_interval = 2048
>     AND memtable_flush_period_in_ms = 0
>     AND min_index_interval = 128
>     AND read_repair_chance = 0.0
>     AND speculative_retry = '99.0PERCENTILE';
> 
> 
> This is a query from Akka cassandra journal (https://github.com/krasserm/akka-persistence-cassandra <https://github.com/krasserm/akka-persistence-cassandra>) which returns 33 rows for the moment.
> 
> The time doesn't vary if I use a QUORUM consistency or ONE.
> Note that the same query but with a different `persistence_id` are much faster (it might depend on the number of rows returned).
> 
> Here's the (redacted) trace taken with cqlsh:
> 
> activity                                                                                          | timestamp                  | source         | source_elapsed
> ---------------------------------------------------------------------------------------------------+----------------------------+----------------+----------------
>                                                                                 Execute CQL3 query | 2015-10-17 17:01:05.681000 | 192.168.168.26 |              0
>   Parsing .................................................................. [SharedPool-Worker-1] | 2015-10-17 17:01:05.683000 | 192.168.168.26 |             84
>             READ message received from /192.168.168.26 <http://192.168.168.26/> [MessagingService-Incoming-/192.168.168.26 <http://192.168.168.26/>] | 2015-10-17 17:01:05.683000 | 192.168.168.29 |             69
>                                                          Preparing statement [SharedPool-Worker-1] | 2015-10-17 17:01:05.683000 | 192.168.168.26 |            215
>                                            reading data from /192.168.168.29 <http://192.168.168.29/> [SharedPool-Worker-1] | 2015-10-17 17:01:05.683000 | 192.168.168.26 |           1189
>                Sending READ message to /192.168.168.29 <http://192.168.168.29/> [MessagingService-Outgoing-/192.168.168.29 <http://192.168.168.29/>] | 2015-10-17 17:01:05.683000 | 192.168.168.26 |           1317
>                                 Executing single-partition query on messages [SharedPool-Worker-1] | 2015-10-17 17:01:05.684000 | 192.168.168.29 |            175
>                                                 Acquiring sstable references [SharedPool-Worker-1] | 2015-10-17 17:01:05.684000 | 192.168.168.29 |            189
>                                                  Merging memtable tombstones [SharedPool-Worker-1] | 2015-10-17 17:01:05.685000 | 192.168.168.29 |            204
>                                                Key cache hit for sstable 434 [SharedPool-Worker-1] | 2015-10-17 17:01:05.685000 | 192.168.168.29 |            257
>                                  Seeking to partition beginning in data file [SharedPool-Worker-1] | 2015-10-17 17:01:05.685000 | 192.168.168.29 |            269
>                                                Key cache hit for sstable 432 [SharedPool-Worker-1] | 2015-10-17 17:01:05.685000 | 192.168.168.29 |            514
>                                  Seeking to partition beginning in data file [SharedPool-Worker-1] | 2015-10-17 17:01:05.686000 | 192.168.168.29 |            527
>                                                Key cache hit for sstable 433 [SharedPool-Worker-1] | 2015-10-17 17:01:05.686000 | 192.168.168.29 |            779
>                                  Seeking to partition beginning in data file [SharedPool-Worker-1] | 2015-10-17 17:01:05.686000 | 192.168.168.29 |            789
>    Skipped 0/3 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-1] | 2015-10-17 17:01:05.686000 | 192.168.168.29 |            929
>                                   Merging data from memtables and 3 sstables [SharedPool-Worker-1] | 2015-10-17 17:01:05.687000 | 192.168.168.29 |            956
>                                    speculating read retry on /192.168.168.26 <http://192.168.168.26/> [SharedPool-Worker-1] | 2015-10-17 17:01:05.724000 | 192.168.168.26 |          42423
>                Sending READ message to /192.168.168.26 <http://192.168.168.26/> [MessagingService-Outgoing-/192.168.168.26 <http://192.168.168.26/>] | 2015-10-17 17:01:05.724000 | 192.168.168.26 |          42573
>             READ message received from /192.168.168.26 <http://192.168.168.26/> [MessagingService-Incoming-/192.168.168.26 <http://192.168.168.26/>] | 2015-10-17 17:01:05.724000 | 192.168.168.26 |          42763
>                                 Executing single-partition query on messages [SharedPool-Worker-2] | 2015-10-17 17:01:05.724000 | 192.168.168.26 |          42855
>                                                 Acquiring sstable references [SharedPool-Worker-2] | 2015-10-17 17:01:05.725000 | 192.168.168.26 |          42868
>                                                  Merging memtable tombstones [SharedPool-Worker-2] | 2015-10-17 17:01:05.725000 | 192.168.168.26 |          42934
>                                     Bloom filter allows skipping sstable 443 [SharedPool-Worker-2] | 2015-10-17 17:01:05.725000 | 192.168.168.26 |          42965
>                                                Key cache hit for sstable 442 [SharedPool-Worker-2] | 2015-10-17 17:01:05.725000 | 192.168.168.26 |          42986
>                                  Seeking to partition beginning in data file [SharedPool-Worker-2] | 2015-10-17 17:01:05.725000 | 192.168.168.26 |          42991
>    Skipped 0/2 non-slice-intersecting sstables, included 0 due to tombstones [SharedPool-Worker-2] | 2015-10-17 17:01:05.726000 | 192.168.168.26 |          43115
>                                   Merging data from memtables and 1 sstables [SharedPool-Worker-2] | 2015-10-17 17:01:05.726000 | 192.168.168.26 |          43137
>                                           Read 33 live and 0 tombstone cells [SharedPool-Worker-2] | 2015-10-17 17:01:11.239000 | 192.168.168.26 |         557248
>                                        Enqueuing response to /192.168.168.26 <http://192.168.168.26/> [SharedPool-Worker-2] | 2015-10-17 17:01:11.244000 | 192.168.168.26 |         562973
>    Sending REQUEST_RESPONSE message to /192.168.168.26 <http://192.168.168.26/> [MessagingService-Outgoing-/192.168.168.26 <http://192.168.168.26/>] | 2015-10-17 17:01:11.245000 | 192.168.168.26 |         563209
> REQUEST_RESPONSE message received from /192.168.168.26 <http://192.168.168.26/> [MessagingService-Incoming-/192.168.168.26 <http://192.168.168.26/>] | 2015-10-17 17:01:11.302000 | 192.168.168.26 |         620575
>                                     Processing response from /192.168.168.26 <http://192.168.168.26/> [SharedPool-Worker-2] | 2015-10-17 17:01:11.302000 | 192.168.168.26 |         620792
>                                           Read 33 live and 0 tombstone cells [SharedPool-Worker-1] | 2015-10-17 17:01:11.330000 | 192.168.168.29 |         646481
>                                        Enqueuing response to /192.168.168.26 <http://192.168.168.26/> [SharedPool-Worker-1] | 2015-10-17 17:01:11.335000 | 192.168.168.29 |         651628
>    Sending REQUEST_RESPONSE message to /192.168.168.26 <http://192.168.168.26/> [MessagingService-Outgoing-/192.168.168.26 <http://192.168.168.26/>] | 2015-10-17 17:01:11.335000 | 192.168.168.29 |         651769
>                                                                                   Request complete | 2015-10-17 17:01:11.312485 | 192.168.168.26 |         631485
> 
> 
> I'm not sure I read correctly the trace, but it seems the time is taken on the:
> "Read 33 live and 0 tombstone cells"
> 
> What does this step do?
> How can I troubleshoot it?
> 
> The cluster configuration is the default one, with the following changes: concurrent_reads and writes at 16 (because it's a very modest hardware, but I used 32 without any differences), increased timeouts, and using hsha.
> 
> Note that I had to increase the timeouts for the request to succeed and be able to trace it (otherwise it was always timing out).
> 
> Note that the same request on the same data loaded from a snapshot on a 1-node cassandra cluster running in a VM on my laptop doesn't exhibit the same issue (it's instantaneous).
> 
> Also, a non-related question: in what unit is the column "source_elapsed"?
> And is the "timestamp" taken at the start of the operation or at the end?
> 
> Thanks for any answers,
> --
> Brice Figureau
>

Re: Read query taking a long time

Posted by Carlos Alonso <in...@mrcalonso.com>.

Could you send cfhistograms and cfstats relevant to the read column family?

That could help

Carlos Alonso | Software Engineer | @calonso <https://twitter.com/calonso>

On 17 October 2015 at 16:15, Brice Figureau <
brice+cassandra@daysofwonder.com> wrote:

> Hi,
>
> I've read all I could find on how cassandra works, I'm still wondering why
> the following query takes more than 5s to return on a simple (and modest) 3
> nodes cassandra 2.1.9 cluster:
>
> SELECT sequence_nr, used
> FROM messages
> WHERE persistence_id = 'session-SW' AND partition_nr = 0;
>
> The schema of this table:
> 
> CREATE TABLE akka.messages (
>     persistence_id text,
>     partition_nr bigint,
>     sequence_nr bigint,
>     message blob,
>     used boolean static,
>     PRIMARY KEY ((persistence_id, partition_nr), sequence_nr)
> ) WITH CLUSTERING ORDER BY (sequence_nr ASC)
>     AND bloom_filter_fp_chance = 0.01
>     AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
>     AND comment = ''
>     AND compaction = {'class':
> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy'}
>     AND compression = {'sstable_compression':
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>     AND dclocal_read_repair_chance = 0.1
>     AND default_time_to_live = 0
>     AND gc_grace_seconds = 864000
>     AND max_index_interval = 2048
>     AND memtable_flush_period_in_ms = 0
>     AND min_index_interval = 128
>     AND read_repair_chance = 0.0
>     AND speculative_retry = '99.0PERCENTILE';
>
>
> This is a query from Akka cassandra journal (
> https://github.com/krasserm/akka-persistence-cassandra) which returns 33
> rows for the moment.
>
> The time doesn't vary if I use a QUORUM consistency or ONE.
> Note that the same query but with a different `persistence_id` are much
> faster (it might depend on the number of rows returned).
>
> Here's the (redacted) trace taken with cqlsh:
>
> activity
>                         | timestamp                  | source         |
> source_elapsed
>
> ---------------------------------------------------------------------------------------------------+----------------------------+----------------+----------------
>
>       Execute CQL3 query | 2015-10-17 17:01:05.681000 | 192.168.168.26 |
>           0
>   Parsing
> ..................................................................
> [SharedPool-Worker-1] | 2015-10-17 17:01:05.683000 | 192.168.168.26 |
>        84
>             READ message received from /192.168.168.26
> [MessagingService-Incoming-/192.168.168.26] | 2015-10-17 17:01:05.683000
> | 192.168.168.29 |             69
>                                                          Preparing
> statement [SharedPool-Worker-1] | 2015-10-17 17:01:05.683000 |
> 192.168.168.26 |            215
>                                            reading data from /
> 192.168.168.29 [SharedPool-Worker-1] | 2015-10-17 17:01:05.683000 |
> 192.168.168.26 |           1189
>                Sending READ message to /192.168.168.29
> [MessagingService-Outgoing-/192.168.168.29] | 2015-10-17 17:01:05.683000
> | 192.168.168.26 |           1317
>                                 Executing single-partition query on
> messages [SharedPool-Worker-1] | 2015-10-17 17:01:05.684000 |
> 192.168.168.29 |            175
>                                                 Acquiring sstable
> references [SharedPool-Worker-1] | 2015-10-17 17:01:05.684000 |
> 192.168.168.29 |            189
>                                                  Merging memtable
> tombstones [SharedPool-Worker-1] | 2015-10-17 17:01:05.685000 |
> 192.168.168.29 |            204
>                                                Key cache hit for sstable
> 434 [SharedPool-Worker-1] | 2015-10-17 17:01:05.685000 | 192.168.168.29 |
>           257
>                                  Seeking to partition beginning in data
> file [SharedPool-Worker-1] | 2015-10-17 17:01:05.685000 | 192.168.168.29 |
>           269
>                                                Key cache hit for sstable
> 432 [SharedPool-Worker-1] | 2015-10-17 17:01:05.685000 | 192.168.168.29 |
>           514
>                                  Seeking to partition beginning in data
> file [SharedPool-Worker-1] | 2015-10-17 17:01:05.686000 | 192.168.168.29 |
>           527
>                                                Key cache hit for sstable
> 433 [SharedPool-Worker-1] | 2015-10-17 17:01:05.686000 | 192.168.168.29 |
>           779
>                                  Seeking to partition beginning in data
> file [SharedPool-Worker-1] | 2015-10-17 17:01:05.686000 | 192.168.168.29 |
>           789
>    Skipped 0/3 non-slice-intersecting sstables, included 0 due to
> tombstones [SharedPool-Worker-1] | 2015-10-17 17:01:05.686000 |
> 192.168.168.29 |            929
>                                   Merging data from memtables and 3
> sstables [SharedPool-Worker-1] | 2015-10-17 17:01:05.687000 |
> 192.168.168.29 |            956
>                                    speculating read retry on /
> 192.168.168.26 [SharedPool-Worker-1] | 2015-10-17 17:01:05.724000 |
> 192.168.168.26 |          42423
>                Sending READ message to /192.168.168.26
> [MessagingService-Outgoing-/192.168.168.26] | 2015-10-17 17:01:05.724000
> | 192.168.168.26 |          42573
>             READ message received from /192.168.168.26
> [MessagingService-Incoming-/192.168.168.26] | 2015-10-17 17:01:05.724000
> | 192.168.168.26 |          42763
>                                 Executing single-partition query on
> messages [SharedPool-Worker-2] | 2015-10-17 17:01:05.724000 |
> 192.168.168.26 |          42855
>                                                 Acquiring sstable
> references [SharedPool-Worker-2] | 2015-10-17 17:01:05.725000 |
> 192.168.168.26 |          42868
>                                                  Merging memtable
> tombstones [SharedPool-Worker-2] | 2015-10-17 17:01:05.725000 |
> 192.168.168.26 |          42934
>                                     Bloom filter allows skipping sstable
> 443 [SharedPool-Worker-2] | 2015-10-17 17:01:05.725000 | 192.168.168.26 |
>         42965
>                                                Key cache hit for sstable
> 442 [SharedPool-Worker-2] | 2015-10-17 17:01:05.725000 | 192.168.168.26 |
>         42986
>                                  Seeking to partition beginning in data
> file [SharedPool-Worker-2] | 2015-10-17 17:01:05.725000 | 192.168.168.26 |
>         42991
>    Skipped 0/2 non-slice-intersecting sstables, included 0 due to
> tombstones [SharedPool-Worker-2] | 2015-10-17 17:01:05.726000 |
> 192.168.168.26 |          43115
>                                   Merging data from memtables and 1
> sstables [SharedPool-Worker-2] | 2015-10-17 17:01:05.726000 |
> 192.168.168.26 |          43137
>                                           Read 33 live and 0 tombstone
> cells [SharedPool-Worker-2] | 2015-10-17 17:01:11.239000 | 192.168.168.26
> |         557248
>                                        Enqueuing response to /
> 192.168.168.26 [SharedPool-Worker-2] | 2015-10-17 17:01:11.244000 |
> 192.168.168.26 |         562973
>    Sending REQUEST_RESPONSE message to /192.168.168.26
> [MessagingService-Outgoing-/192.168.168.26] | 2015-10-17 17:01:11.245000
> | 192.168.168.26 |         563209
> REQUEST_RESPONSE message received from /192.168.168.26
> [MessagingService-Incoming-/192.168.168.26] | 2015-10-17 17:01:11.302000
> | 192.168.168.26 |         620575
>                                     Processing response from /
> 192.168.168.26 [SharedPool-Worker-2] | 2015-10-17 17:01:11.302000 |
> 192.168.168.26 |         620792
>                                           Read 33 live and 0 tombstone
> cells [SharedPool-Worker-1] | 2015-10-17 17:01:11.330000 | 192.168.168.29
> |         646481
>                                        Enqueuing response to /
> 192.168.168.26 [SharedPool-Worker-1] | 2015-10-17 17:01:11.335000 |
> 192.168.168.29 |         651628
>    Sending REQUEST_RESPONSE message to /192.168.168.26
> [MessagingService-Outgoing-/192.168.168.26] | 2015-10-17 17:01:11.335000
> | 192.168.168.29 |         651769
>
>         Request complete | 2015-10-17 17:01:11.312485 | 192.168.168.26 |
>      631485
>
>
> I'm not sure I read correctly the trace, but it seems the time is taken on
> the:
> "Read 33 live and 0 tombstone cells"
>
> What does this step do?
> How can I troubleshoot it?
>
> The cluster configuration is the default one, with the following changes:
> concurrent_reads and writes at 16 (because it's a very modest hardware, but
> I used 32 without any differences), increased timeouts, and using hsha.
>
> Note that I had to increase the timeouts for the request to succeed and be
> able to trace it (otherwise it was always timing out).
>
> Note that the same request on the same data loaded from a snapshot on a
> 1-node cassandra cluster running in a VM on my laptop doesn't exhibit the
> same issue (it's instantaneous).
>
> Also, a non-related question: in what unit is the column "source_elapsed"?
> And is the "timestamp" taken at the start of the operation or at the end?
>
> Thanks for any answers,
> --
> Brice Figureau
>