You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Tom van den Berge <to...@gmail.com> on 2015/09/08 17:22:32 UTC

Trace evidence for LOCAL_QUORUM ending up in remote DC

I've been bugging you a few times, but now I've got trace data for a query
with LOCAL_QUORUM that is being sent to a remove data center.

The setup is as follows:
NetworkTopologyStrategy: {"DC1":"1","DC2":"2"}
Both DC1 and DC2 have 2 nodes.
In DC2, one node is currently being rebuilt, and therefore does not contain
all data (yet).

The client app connects to a node in DC1, and sends a SELECT query with CL
LOCAL_QUORUM, which in this case means ((1/2)+1=1.
If all is ok, the query always produces a result, because the requested
rows are guaranteed to be available in DC1.

However, the query sometimes produces no result. I've been able to record
the traces of these queries, and it turns out that the coordinator node in
DC1 sometimes sends the query to DC2, to the node that is being rebuilt,
and does not have the requested rows. I've included an example trace below.

The coordinator node is 10.55.156.67, which is in DC1. The 10.88.4.194 node
is in DC2.
I've verified that the  CL=LOCAL_QUORUM by printing it when the query is
sent (I'm using the datastax java driver).

 activity
 | source       | source_elapsed | thread
---------------------------------------------------------------------------+--------------+----------------+-----------------------------------------
                                       Message received from /10.55.156.67
|  10.88.4.194 |             48 | MessagingService-Incoming-/10.55.156.67
                             Executing single-partition query on aggregate
|  10.88.4.194 |            286 |                     SharedPool-Worker-2
                                              Acquiring sstable references
|  10.88.4.194 |            306 |                     SharedPool-Worker-2
                                               Merging memtable tombstones
|  10.88.4.194 |            321 |                     SharedPool-Worker-2
                        Partition index lookup allows skipping sstable 107
|  10.88.4.194 |            458 |                     SharedPool-Worker-2
                                    Bloom filter allows skipping sstable 1
|  10.88.4.194 |            489 |                     SharedPool-Worker-2
 Skipped 0/2 non-slice-intersecting sstables, included 0 due to tombstones
|  10.88.4.194 |            496 |                     SharedPool-Worker-2
                                Merging data from memtables and 0 sstables
|  10.88.4.194 |            500 |                     SharedPool-Worker-2
                                         Read 0 live and 0 tombstone cells
|  10.88.4.194 |            513 |                     SharedPool-Worker-2
                                       Enqueuing response to /10.55.156.67
|  10.88.4.194 |            613 |                     SharedPool-Worker-2
                                          Sending message to /10.55.156.67
|  10.88.4.194 |            672 | MessagingService-Outgoing-/10.55.156.67
                Parsing SELECT * FROM Aggregate WHERE type=? AND typeId=?;
| 10.55.156.67 |             10 |                     SharedPool-Worker-4
                                           Sending message to /10.88.4.194
| 10.55.156.67 |           4335 |  MessagingService-Outgoing-/10.88.4.194
                                        Message received from /10.88.4.194
| 10.55.156.67 |           6328 |  MessagingService-Incoming-/10.88.4.194
                               Seeking to partition beginning in data file
| 10.55.156.67 |          10417 |                     SharedPool-Worker-3
                                             Key cache hit for sstable 389
| 10.55.156.67 |          10586 |                     SharedPool-Worker-3

My question is: how is it possible that the query is sent to a node in DC2?
Since DC1 has 2 nodes and RF 1, the query should always be sent to the
other node in DC1 if the coordinator does not have a replica, right?

Thanks,
Tom

Re: Trace evidence for LOCAL_QUORUM ending up in remote DC

Posted by Bryan Cheng <br...@blockcypher.com>.

Tom, I don't believe so; it seems the symptom would be an indefinite (or
very long) hang.

To clarify, is this issue restricted to LOCAL_QUORUM? Can you issue a
LOCAL_ONE SELECT and retrieve the expected data back?

On Tue, Sep 8, 2015 at 12:02 PM, Tom van den Berge <
tom.vandenberge@gmail.com> wrote:

> Just to be sure: can this bug result in a 0-row result while it should be
> > 0 ?
> Op 8 sep. 2015 6:29 PM schreef "Tyler Hobbs" <ty...@datastax.com>:
>
> See https://issues.apache.org/jira/browse/CASSANDRA-9753
>>
>> On Tue, Sep 8, 2015 at 10:22 AM, Tom van den Berge <
>> tom.vandenberge@gmail.com> wrote:
>>
>>> I've been bugging you a few times, but now I've got trace data for a
>>> query with LOCAL_QUORUM that is being sent to a remove data center.
>>>
>>> The setup is as follows:
>>> NetworkTopologyStrategy: {"DC1":"1","DC2":"2"}
>>> Both DC1 and DC2 have 2 nodes.
>>> In DC2, one node is currently being rebuilt, and therefore does not
>>> contain all data (yet).
>>>
>>> The client app connects to a node in DC1, and sends a SELECT query with
>>> CL LOCAL_QUORUM, which in this case means ((1/2)+1=1.
>>> If all is ok, the query always produces a result, because the requested
>>> rows are guaranteed to be available in DC1.
>>>
>>> However, the query sometimes produces no result. I've been able to
>>> record the traces of these queries, and it turns out that the coordinator
>>> node in DC1 sometimes sends the query to DC2, to the node that is being
>>> rebuilt, and does not have the requested rows. I've included an example
>>> trace below.
>>>
>>> The coordinator node is 10.55.156.67, which is in DC1. The 10.88.4.194 node
>>> is in DC2.
>>> I've verified that the  CL=LOCAL_QUORUM by printing it when the query is
>>> sent (I'm using the datastax java driver).
>>>
>>>  activity
>>>    | source       | source_elapsed | thread
>>>
>>> ---------------------------------------------------------------------------+--------------+----------------+-----------------------------------------
>>>                                        Message received from /
>>> 10.55.156.67 |  10.88.4.194 |             48 |
>>> MessagingService-Incoming-/10.55.156.67
>>>                              Executing single-partition query on
>>> aggregate |  10.88.4.194 |            286 |
>>> SharedPool-Worker-2
>>>                                               Acquiring sstable
>>> references |  10.88.4.194 |            306 |
>>> SharedPool-Worker-2
>>>                                                Merging memtable
>>> tombstones |  10.88.4.194 |            321 |
>>> SharedPool-Worker-2
>>>                         Partition index lookup allows skipping sstable
>>> 107 |  10.88.4.194 |            458 |
>>> SharedPool-Worker-2
>>>                                     Bloom filter allows skipping sstable
>>> 1 |  10.88.4.194 |            489 |                     SharedPool-Worker-2
>>>  Skipped 0/2 non-slice-intersecting sstables, included 0 due to
>>> tombstones |  10.88.4.194 |            496 |
>>> SharedPool-Worker-2
>>>                                 Merging data from memtables and 0
>>> sstables |  10.88.4.194 |            500 |
>>> SharedPool-Worker-2
>>>                                          Read 0 live and 0 tombstone
>>> cells |  10.88.4.194 |            513 |
>>> SharedPool-Worker-2
>>>                                        Enqueuing response to /
>>> 10.55.156.67 |  10.88.4.194 |            613 |
>>> SharedPool-Worker-2
>>>                                           Sending message to /
>>> 10.55.156.67 |  10.88.4.194 |            672 |
>>> MessagingService-Outgoing-/10.55.156.67
>>>                 Parsing SELECT * FROM Aggregate WHERE type=? AND
>>> typeId=?; | 10.55.156.67 |             10 |
>>> SharedPool-Worker-4
>>>                                            Sending message to /
>>> 10.88.4.194 | 10.55.156.67 |           4335 |
>>>  MessagingService-Outgoing-/10.88.4.194
>>>                                         Message received from /
>>> 10.88.4.194 | 10.55.156.67 |           6328 |
>>>  MessagingService-Incoming-/10.88.4.194
>>>                                Seeking to partition beginning in data
>>> file | 10.55.156.67 |          10417 |
>>> SharedPool-Worker-3
>>>                                              Key cache hit for sstable
>>> 389 | 10.55.156.67 |          10586 |
>>> SharedPool-Worker-3
>>>
>>> My question is: how is it possible that the query is sent to a node in
>>> DC2?
>>> Since DC1 has 2 nodes and RF 1, the query should always be sent to the
>>> other node in DC1 if the coordinator does not have a replica, right?
>>>
>>> Thanks,
>>> Tom
>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Tyler Hobbs
>> DataStax <http://datastax.com/>
>>
>

Re: Trace evidence for LOCAL_QUORUM ending up in remote DC

Posted by Nate McCall <na...@thelastpickle.com>.

Thanks for reporting back, Tom.

Can you drop a comment on the ticket with a sentence or two describing your
specific case and that speculative_retry = NONE was a valid work-around?

That will make it easier for the next folks that come along to have a
concrete problem/solution in a single comment on that ticket.

Glad to hear it worked, though.

On Tue, Sep 8, 2015 at 3:38 PM, Tom van den Berge <tom.vandenberge@gmail.com
> wrote:

> Nate,
>
> I've disabled it, and it's been running for about an hour now without
> problems, while before, the problem occurred roughly every few minutes. I
> guess it's safe to say that this proves that CASSANDRA-9753
> <https://issues.apache.org/jira/browse/CASSANDRA-9753> is the cause of
> the problem.
>
> I'm very happy to finally know the cause of this problem! Thanks for
> pointing me in the right direction.
> Tom
>
> On Tue, Sep 8, 2015 at 9:13 PM, Nate McCall <na...@thelastpickle.com>
> wrote:
>
>> Just to be sure: can this bug result in a 0-row result while it should be
>>> > 0 ?
>>>
>> Per Tyler's reference to CASSANDRA-9753
>> <https://issues.apache.org/jira/browse/CASSANDRA-9753>, you would see
>> this if the read was routed by speculative retry to the nodes that were not
>> yet finished being built.
>>
>> Does this work as anticipated when you set speculative_retry to NONE?
>>
>>
>>
>>
>> --
>> -----------------
>> Nate McCall
>> Austin, TX
>> @zznate
>>
>> Co-Founder & Sr. Technical Consultant
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>
>


-- 
-----------------
Nate McCall
Austin, TX
@zznate

Co-Founder & Sr. Technical Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Trace evidence for LOCAL_QUORUM ending up in remote DC

Posted by Tom van den Berge <to...@gmail.com>.

Nate,

I've disabled it, and it's been running for about an hour now without
problems, while before, the problem occurred roughly every few minutes. I
guess it's safe to say that this proves that CASSANDRA-9753
<https://issues.apache.org/jira/browse/CASSANDRA-9753> is the cause of the
problem.

I'm very happy to finally know the cause of this problem! Thanks for
pointing me in the right direction.
Tom

On Tue, Sep 8, 2015 at 9:13 PM, Nate McCall <na...@thelastpickle.com> wrote:

> Just to be sure: can this bug result in a 0-row result while it should be
>> > 0 ?
>>
> Per Tyler's reference to CASSANDRA-9753
> <https://issues.apache.org/jira/browse/CASSANDRA-9753>, you would see
> this if the read was routed by speculative retry to the nodes that were not
> yet finished being built.
>
> Does this work as anticipated when you set speculative_retry to NONE?
>
>
>
>
> --
> -----------------
> Nate McCall
> Austin, TX
> @zznate
>
> Co-Founder & Sr. Technical Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>

Re: Trace evidence for LOCAL_QUORUM ending up in remote DC

Posted by Nate McCall <na...@thelastpickle.com>.

>
> Just to be sure: can this bug result in a 0-row result while it should be
> > 0 ?
>
Per Tyler's reference to CASSANDRA-9753
<https://issues.apache.org/jira/browse/CASSANDRA-9753>, you would see this
if the read was routed by speculative retry to the nodes that were not yet
finished being built.

Does this work as anticipated when you set speculative_retry to NONE?




-- 
-----------------
Nate McCall
Austin, TX
@zznate

Co-Founder & Sr. Technical Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Trace evidence for LOCAL_QUORUM ending up in remote DC

Posted by Tom van den Berge <to...@gmail.com>.

Just to be sure: can this bug result in a 0-row result while it should be >
0 ?
Op 8 sep. 2015 6:29 PM schreef "Tyler Hobbs" <ty...@datastax.com>:

> See https://issues.apache.org/jira/browse/CASSANDRA-9753
>
> On Tue, Sep 8, 2015 at 10:22 AM, Tom van den Berge <
> tom.vandenberge@gmail.com> wrote:
>
>> I've been bugging you a few times, but now I've got trace data for a
>> query with LOCAL_QUORUM that is being sent to a remove data center.
>>
>> The setup is as follows:
>> NetworkTopologyStrategy: {"DC1":"1","DC2":"2"}
>> Both DC1 and DC2 have 2 nodes.
>> In DC2, one node is currently being rebuilt, and therefore does not
>> contain all data (yet).
>>
>> The client app connects to a node in DC1, and sends a SELECT query with
>> CL LOCAL_QUORUM, which in this case means ((1/2)+1=1.
>> If all is ok, the query always produces a result, because the requested
>> rows are guaranteed to be available in DC1.
>>
>> However, the query sometimes produces no result. I've been able to record
>> the traces of these queries, and it turns out that the coordinator node in
>> DC1 sometimes sends the query to DC2, to the node that is being rebuilt,
>> and does not have the requested rows. I've included an example trace below.
>>
>> The coordinator node is 10.55.156.67, which is in DC1. The 10.88.4.194 node
>> is in DC2.
>> I've verified that the  CL=LOCAL_QUORUM by printing it when the query is
>> sent (I'm using the datastax java driver).
>>
>>  activity
>>    | source       | source_elapsed | thread
>>
>> ---------------------------------------------------------------------------+--------------+----------------+-----------------------------------------
>>                                        Message received from /
>> 10.55.156.67 |  10.88.4.194 |             48 |
>> MessagingService-Incoming-/10.55.156.67
>>                              Executing single-partition query on
>> aggregate |  10.88.4.194 |            286 |
>> SharedPool-Worker-2
>>                                               Acquiring sstable
>> references |  10.88.4.194 |            306 |
>> SharedPool-Worker-2
>>                                                Merging memtable
>> tombstones |  10.88.4.194 |            321 |
>> SharedPool-Worker-2
>>                         Partition index lookup allows skipping sstable
>> 107 |  10.88.4.194 |            458 |
>> SharedPool-Worker-2
>>                                     Bloom filter allows skipping sstable
>> 1 |  10.88.4.194 |            489 |                     SharedPool-Worker-2
>>  Skipped 0/2 non-slice-intersecting sstables, included 0 due to
>> tombstones |  10.88.4.194 |            496 |
>> SharedPool-Worker-2
>>                                 Merging data from memtables and 0
>> sstables |  10.88.4.194 |            500 |
>> SharedPool-Worker-2
>>                                          Read 0 live and 0 tombstone
>> cells |  10.88.4.194 |            513 |
>> SharedPool-Worker-2
>>                                        Enqueuing response to /
>> 10.55.156.67 |  10.88.4.194 |            613 |
>> SharedPool-Worker-2
>>                                           Sending message to /
>> 10.55.156.67 |  10.88.4.194 |            672 |
>> MessagingService-Outgoing-/10.55.156.67
>>                 Parsing SELECT * FROM Aggregate WHERE type=? AND
>> typeId=?; | 10.55.156.67 |             10 |
>> SharedPool-Worker-4
>>                                            Sending message to /
>> 10.88.4.194 | 10.55.156.67 |           4335 |
>>  MessagingService-Outgoing-/10.88.4.194
>>                                         Message received from /
>> 10.88.4.194 | 10.55.156.67 |           6328 |
>>  MessagingService-Incoming-/10.88.4.194
>>                                Seeking to partition beginning in data
>> file | 10.55.156.67 |          10417 |
>> SharedPool-Worker-3
>>                                              Key cache hit for sstable
>> 389 | 10.55.156.67 |          10586 |
>> SharedPool-Worker-3
>>
>> My question is: how is it possible that the query is sent to a node in
>> DC2?
>> Since DC1 has 2 nodes and RF 1, the query should always be sent to the
>> other node in DC1 if the coordinator does not have a replica, right?
>>
>> Thanks,
>> Tom
>>
>>
>>
>>
>>
>
>
> --
> Tyler Hobbs
> DataStax <http://datastax.com/>
>

Re: Trace evidence for LOCAL_QUORUM ending up in remote DC

Posted by Tyler Hobbs <ty...@datastax.com>.

See https://issues.apache.org/jira/browse/CASSANDRA-9753

On Tue, Sep 8, 2015 at 10:22 AM, Tom van den Berge <
tom.vandenberge@gmail.com> wrote:

> I've been bugging you a few times, but now I've got trace data for a query
> with LOCAL_QUORUM that is being sent to a remove data center.
>
> The setup is as follows:
> NetworkTopologyStrategy: {"DC1":"1","DC2":"2"}
> Both DC1 and DC2 have 2 nodes.
> In DC2, one node is currently being rebuilt, and therefore does not
> contain all data (yet).
>
> The client app connects to a node in DC1, and sends a SELECT query with CL
> LOCAL_QUORUM, which in this case means ((1/2)+1=1.
> If all is ok, the query always produces a result, because the requested
> rows are guaranteed to be available in DC1.
>
> However, the query sometimes produces no result. I've been able to record
> the traces of these queries, and it turns out that the coordinator node in
> DC1 sometimes sends the query to DC2, to the node that is being rebuilt,
> and does not have the requested rows. I've included an example trace below.
>
> The coordinator node is 10.55.156.67, which is in DC1. The 10.88.4.194 node
> is in DC2.
> I've verified that the  CL=LOCAL_QUORUM by printing it when the query is
> sent (I'm using the datastax java driver).
>
>  activity
>  | source       | source_elapsed | thread
>
> ---------------------------------------------------------------------------+--------------+----------------+-----------------------------------------
>                                        Message received from /10.55.156.67
> |  10.88.4.194 |             48 | MessagingService-Incoming-/10.55.156.67
>                              Executing single-partition query on aggregate
> |  10.88.4.194 |            286 |                     SharedPool-Worker-2
>                                               Acquiring sstable references
> |  10.88.4.194 |            306 |                     SharedPool-Worker-2
>                                                Merging memtable tombstones
> |  10.88.4.194 |            321 |                     SharedPool-Worker-2
>                         Partition index lookup allows skipping sstable 107
> |  10.88.4.194 |            458 |                     SharedPool-Worker-2
>                                     Bloom filter allows skipping sstable 1
> |  10.88.4.194 |            489 |                     SharedPool-Worker-2
>  Skipped 0/2 non-slice-intersecting sstables, included 0 due to tombstones
> |  10.88.4.194 |            496 |                     SharedPool-Worker-2
>                                 Merging data from memtables and 0 sstables
> |  10.88.4.194 |            500 |                     SharedPool-Worker-2
>                                          Read 0 live and 0 tombstone cells
> |  10.88.4.194 |            513 |                     SharedPool-Worker-2
>                                        Enqueuing response to /10.55.156.67
> |  10.88.4.194 |            613 |                     SharedPool-Worker-2
>                                           Sending message to /10.55.156.67
> |  10.88.4.194 |            672 | MessagingService-Outgoing-/10.55.156.67
>                 Parsing SELECT * FROM Aggregate WHERE type=? AND typeId=?;
> | 10.55.156.67 |             10 |                     SharedPool-Worker-4
>                                            Sending message to /10.88.4.194
> | 10.55.156.67 |           4335 |  MessagingService-Outgoing-/10.88.4.194
>                                         Message received from /10.88.4.194
> | 10.55.156.67 |           6328 |  MessagingService-Incoming-/10.88.4.194
>                                Seeking to partition beginning in data file
> | 10.55.156.67 |          10417 |                     SharedPool-Worker-3
>                                              Key cache hit for sstable 389
> | 10.55.156.67 |          10586 |                     SharedPool-Worker-3
>
> My question is: how is it possible that the query is sent to a node in
> DC2?
> Since DC1 has 2 nodes and RF 1, the query should always be sent to the
> other node in DC1 if the coordinator does not have a replica, right?
>
> Thanks,
> Tom
>
>
>
>
>


-- 
Tyler Hobbs
DataStax <http://datastax.com/>