You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "Donal Evans (Jira)" <ji...@apache.org> on 2021/02/03 19:30:00 UTC

[jira] [Comment Edited] (GEODE-8890) ClassCastException in LocatorDiscovery.exchangeRemoteLocators

    [ https://issues.apache.org/jira/browse/GEODE-8890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278313#comment-17278313 ] 

Donal Evans edited comment on GEODE-8890 at 2/3/21, 7:29 PM:
-------------------------------------------------------------

In the logs associated with the original failure, we see that an earlier call to {{TcpClient.requestToServer()}} in {{LocatorDiscovery.exchangeRemoteLocators()}} with a {{RemoteLocatorJoinRequest}} succeeds, as the following output is logged: 
{code:java}
[vm2] [info 2020/12/25 05:33:18.209 GMT locator-2 <WAN Locator Discovery Thread1> tid=0x52] Locator discovery task exchanged locator information 7a2096ae9648[44531] with localhost[45273]: {1=[7a2096ae9648[45273]], 2=[7a2096ae9648[44531]]}.{code}
 

Next, the code sleeps for {{WAN_LOCATOR_PING_INTERVAL}} ms (default value is 10 seconds), then attempts to ping the remote locator with a call to {{TcpClient.requestToServer()}} with a {{RemoteLocatorPingRequest}}. It is this second call that results in the {{ClassCastException}}, as can be seen by the time stamp on the fatal log ({{2020/12/25 05:33:28.218 GMT}}), which is almost exactly 10 seconds after the first log.

What is strange is that the test which logged the exception started after the first logged message from the WAN Locator Discovery Thread: 
{code:java}
[vm3] [info 2020/12/25 05:33:26.063 GMT <Geode Failure Detection Server thread 1> tid=0x33] GMSHealthMonitor server thread exiting [setup] START TEST org.apache.geode.internal.cache.wan.serial.WANHostNameVerificationDistributedTest.enableHostNameValidationAcrossAllComponents [vm0] [info 2020/12/25 05:33:26.065 GMT locator-0 <RMI TCP Connection(5)-172.17.0.10> tid=0x5a] Received method: org.apache.geode.test.dunit.rules.DistributedRestoreSystemProperties$$Lambda$19/39844442.run with 0 args on object:{code}
which implies that the thread doing the logging was started by a previous test.

When searching the logs for the ports used by the locators in the first logged message (44531 and 45273), we see that both locators that are mentioned stop before the test that logged the exception started: 
{code:java}
[vm2] [info 2020/12/25 05:33:25.705 GMT locator-2 <RMI TCP Connection(1)-172.17.0.10> tid=0x20] Distribution Locator on 7a2096ae9648[44531] is stopped 
{code}
 
{code:java}
[vm0] [info 2020/12/25 05:33:25.989 GMT locator-0 <RMI TCP Connection(5)-172.17.0.10> tid=0x5a] Distribution Locator on 7a2096ae9648[45273] is stopped {code}
but also see that a locator with the same port (45273) is started shortly before the test logged the exception: 
{code:java}
[vm0] [info 2020/12/25 05:33:27.188 GMT locator-0 <RMI TCP Connection(5)-172.17.0.10> tid=0x5a] DistributionManager 172.17.0.10(locator-0:245:locator)<ec><v0>:41001 started on 7a2096ae9648[45273]. There were 0 other DMs. others: [] (took 637 ms) (locator) {code}
which may be relevant.


was (Author: donalevans):
In the logs associated with the original failure, we see that an earlier call to {{TcpClient.requestToServer()}} in {{LocatorDiscovery.exchangeRemoteLocators()}} with a {{RemoteLocatorJoinRequest}} succeeds, as the following output is logged: 

{{}}
{code:java}

{code}
{{ [vm2] [info 2020/12/25 05:33:18.209 GMT locator-2 <WAN Locator Discovery Thread1> tid=0x52] Locator discovery task exchanged locator information 7a2096ae9648[44531] with localhost[45273]: \{1=[7a2096ae9648[45273]], 2=[7a2096ae9648[44531]]}.}}

 

Next, the code sleeps for {{WAN_LOCATOR_PING_INTERVAL}} ms (default value is 10 seconds), then attempts to ping the remote locator with a call to {{TcpClient.requestToServer()}} with a {{RemoteLocatorPingRequest}}. It is this second call that results in the {{ClassCastException}}, as can be seen by the time stamp on the fatal log ({{2020/12/25 05:33:28.218 GMT}}), which is almost exactly 10 seconds after the first log.

What is strange is that the test which logged the exception started after the first logged message from the WAN Locator Discovery Thread: 

{{}}
{code:java}
[vm3] [info 2020/12/25 05:33:26.063 GMT <Geode Failure Detection Server thread 1> tid=0x33] GMSHealthMonitor server thread exiting [setup] START TEST org.apache.geode.internal.cache.wan.serial.WANHostNameVerificationDistributedTest.enableHostNameValidationAcrossAllComponents [vm0] [info 2020/12/25 05:33:26.065 GMT locator-0 <RMI TCP Connection(5)-172.17.0.10> tid=0x5a] Received method: org.apache.geode.test.dunit.rules.DistributedRestoreSystemProperties$$Lambda$19/39844442.run with 0 args on object:{code}
{{}}

which implies that the thread doing the logging was started by a previous test.

When searching the logs for the ports used by the locators in the first logged message (44531 and 45273), we see that both locators that are mentioned stop before the test that logged the exception started: 

{{}}
{code:java}
[vm2] [info 2020/12/25 05:33:25.705 GMT locator-2 <RMI TCP Connection(1)-172.17.0.10> tid=0x20] Distribution Locator on 7a2096ae9648[44531] is stopped 
{code}
{{}}

{{}}

{{}}
{code:java}
[vm0] [info 2020/12/25 05:33:25.989 GMT locator-0 <RMI TCP Connection(5)-172.17.0.10> tid=0x5a] Distribution Locator on 7a2096ae9648[45273] is stopped {code}
{{}}

but also see that a locator with the same port (45273) is started shortly before the test logged the exception: 

{{}}
{code:java}
[vm0] [info 2020/12/25 05:33:27.188 GMT locator-0 <RMI TCP Connection(5)-172.17.0.10> tid=0x5a] DistributionManager 172.17.0.10(locator-0:245:locator)<ec><v0>:41001 started on 7a2096ae9648[45273]. There were 0 other DMs. others: [] (took 637 ms) (locator) {code}
{{}}

which may be relevant.

> ClassCastException in LocatorDiscovery.exchangeRemoteLocators
> -------------------------------------------------------------
>
>                 Key: GEODE-8890
>                 URL: https://issues.apache.org/jira/browse/GEODE-8890
>             Project: Geode
>          Issue Type: Bug
>          Components: wan
>    Affects Versions: 1.10.0, 1.11.0, 1.12.0, 1.13.1, 1.14.0
>            Reporter: Donal Evans
>            Assignee: Donal Evans
>            Priority: Major
>              Labels: blocks-1.14.0​, pull-request-available
>
> {noformat}
> org.apache.geode.internal.cache.wan.serial.WANHostNameVerificationDistributedTest > enableHostNameValidationAcrossAllComponents FAILED
>     java.lang.AssertionError: Suspicious strings were written to the log during this run.
>     Fix the strings or use IgnoredException.addIgnoredException to ignore.
>     -----------------------------------------------------------------------
>     Found suspect string in log4j at line 682
>     [fatal 2020/12/25 05:33:28.218 GMT <WAN Locator Discovery Thread1> tid=82] Uncaught exception in thread Thread[WAN Locator Discovery Thread1,5,RMI Runtime]
>     java.lang.ClassCastException: java.lang.Class cannot be cast to org.apache.geode.cache.client.internal.locator.wan.RemoteLocatorPingResponse
>       at org.apache.geode.cache.client.internal.locator.wan.LocatorDiscovery.exchangeRemoteLocators(LocatorDiscovery.java:191)
>       at org.apache.geode.cache.client.internal.locator.wan.LocatorDiscovery$RemoteLocatorDiscovery.run(LocatorDiscovery.java:121)
>       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>       at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)