You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jeff Jirsa (JIRA)" <ji...@apache.org> on 2017/04/01 23:37:41 UTC
[jira] [Comment Edited] (CASSANDRA-13196) test failure in snitch_test.TestGossipingPropertyFileSnitch.test_prefer_local_reconnect_on_listen_address

    [ https://issues.apache.org/jira/browse/CASSANDRA-13196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15952462#comment-15952462 ] 

Jeff Jirsa edited comment on CASSANDRA-13196 at 4/1/17 11:36 PM:
-----------------------------------------------------------------

Wouldn't be surprised if there was a race condition there - there was a not-dissimilar race solved recently in CASSANDRA-12653 where the race was in setting up token metadata as the node came out of shadow round, and this is fairly similar - we come out of shadow round at {{2017-02-06 22:13:17,494}} , we submit the migration tasks at {{2017-02-06 22:13:20,622}} and immediately {{2017-02-06 22:13:20,623}} decide not to send it, and FD finally sees the nodes come up at {{2017-02-06 22:13:20,665}} - at the very least, I'm not sure why we'd even try to submit the migration task knowing the instance was down - requeueing the schema pull immediately on failure here definitely wouldn't have helped (we'd have failed to send it again, as the instance was still down). I'm sort of wondering if this test still fails with #12653 committed - it doesn't seem like it's the exact same issue, but maybe the changes from 12653 also help with this race? 

I'm not sure of the history here, but it seems like [MigrationManager#shouldPullSchemaFrom|https://github.com/Gerrrr/cassandra/blob/463f3fecd9348ea0a4ce6eeeb30141527b8b10eb/src/java/org/apache/cassandra/schema/MigrationManager.java#L125] could potentially check that endpoint's UP/DOWN in addition to messaging version.  



was (Author: jjirsa):
Wouldn't be surprised if there was a race condition there - there was a not-dissimilar race solved recently in CASSANDRA-12653 where the race was in setting up token metadata as the node came out of shadow round, and this is fairly similar - we come out of shadow round at {{2017-02-06 22:13:17,494}} , we submit the migration tasks at {{2017-02-06 22:13:20,622}} and immediately {{2017-02-06 22:13:20,623}} decide not to send it, and FD finally sees the nodes come up at {{2017-02-06 22:13:20,665}} - at the very least, I'm not sure why we'd even try to submit the migration task knowing the instance was down - requeueing the schema pull immediately on failure here definitely wouldn't have helped (we'd have failed to send it again, as the instance was still down).

I'm not sure of the history here, but it seems like [MigrationManager#shouldPullSchemaFrom|https://github.com/Gerrrr/cassandra/blob/463f3fecd9348ea0a4ce6eeeb30141527b8b10eb/src/java/org/apache/cassandra/schema/MigrationManager.java#L125] could potentially check that endpoint's UP/DOWN in addition to messaging version.  


> test failure in snitch_test.TestGossipingPropertyFileSnitch.test_prefer_local_reconnect_on_listen_address
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-13196
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13196
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Michael Shuler
>            Assignee: Aleksandr Sorokoumov
>              Labels: dtest, test-failure
>         Attachments: node1_debug.log, node1_gc.log, node1.log, node2_debug.log, node2_gc.log, node2.log
>
>
> example failure:
> http://cassci.datastax.com/job/trunk_dtest/1487/testReport/snitch_test/TestGossipingPropertyFileSnitch/test_prefer_local_reconnect_on_listen_address
> {code}
> {novnode}
> Error Message
> Error from server: code=2200 [Invalid query] message="keyspace keyspace1 does not exist"
> -------------------- >> begin captured logging << --------------------
> dtest: DEBUG: cluster ccm directory: /tmp/dtest-k6b0iF
> dtest: DEBUG: Done setting configuration options:
> {   'initial_token': None,
>     'num_tokens': '32',
>     'phi_convict_threshold': 5,
>     'range_request_timeout_in_ms': 10000,
>     'read_request_timeout_in_ms': 10000,
>     'request_timeout_in_ms': 10000,
>     'truncate_request_timeout_in_ms': 10000,
>     'write_request_timeout_in_ms': 10000}
> cassandra.policies: INFO: Using datacenter 'dc1' for DCAwareRoundRobinPolicy (via host '127.0.0.1'); if incorrect, please specify a local_dc to the constructor, or limit contact points to local cluster nodes
> cassandra.cluster: INFO: New Cassandra host <Host: 127.0.0.1 dc1> discovered
> --------------------- >> end captured logging << ---------------------
> Stacktrace
>   File "/usr/lib/python2.7/unittest/case.py", line 329, in run
>     testMethod()
>   File "/home/automaton/cassandra-dtest/snitch_test.py", line 87, in test_prefer_local_reconnect_on_listen_address
>     new_rows = list(session.execute("SELECT * FROM {}".format(stress_table)))
>   File "/home/automaton/src/cassandra-driver/cassandra/cluster.py", line 1998, in execute
>     return self.execute_async(query, parameters, trace, custom_payload, timeout, execution_profile, paging_state).result()
>   File "/home/automaton/src/cassandra-driver/cassandra/cluster.py", line 3784, in result
>     raise self._final_exception
> 'Error from server: code=2200 [Invalid query] message="keyspace keyspace1 does not exist"\n-------------------- >> begin captured logging << --------------------\ndtest: DEBUG: cluster ccm directory: /tmp/dtest-k6b0iF\ndtest: DEBUG: Done setting configuration options:\n{   \'initial_token\': None,\n    \'num_tokens\': \'32\',\n    \'phi_convict_threshold\': 5,\n    \'range_request_timeout_in_ms\': 10000,\n    \'read_request_timeout_in_ms\': 10000,\n    \'request_timeout_in_ms\': 10000,\n    \'truncate_request_timeout_in_ms\': 10000,\n    \'write_request_timeout_in_ms\': 10000}\ncassandra.policies: INFO: Using datacenter \'dc1\' for DCAwareRoundRobinPolicy (via host \'127.0.0.1\'); if incorrect, please specify a local_dc to the constructor, or limit contact points to local cluster nodes\ncassandra.cluster: INFO: New Cassandra host <Host: 127.0.0.1 dc1> discovered\n--------------------- >> end captured logging << ---------------------'
> {novnode}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)