You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by Mahesh Renduchintala <ma...@aline-consulting.com> on 2019/10/01 18:33:28 UTC

Re: GridCachePartitionExchangeManager Null pointer exception

ilya,

is there a workaround for this problem? I reattach fresh logs
We were hit with this bug in a production environment causing a significant downtime.
I updated this bug with a few other comments. The WA they suggested is not feasible.
https://issues.apache.org/jira/browse/IGNITE-10010

Thanks
mahesh

Re: GridCachePartitionExchangeManager Null pointer exception

Posted by Ilya Kasnacheev <il...@gmail.com>.

Hello!

In Java, assertions is a run-time property. You can enable them by passing
-ea flag to JVM. Note that we don't recommend running Ignite with
assertions on.

Regards,
-- 
Ilya Kasnacheev


пт, 11 окт. 2019 г. в 11:52, maheshkr76private <ma...@gmail.com>:

> Pavel, are ignite 2.7.6 binaries built with assertions disabled? THis could
> explain the null pointer exception seen here on the server-side. I am still
> not following if the null pointer exception, that I am reporting here is
> understood and if there is a defect filed for this
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Re: GridCachePartitionExchangeManager Null pointer exception

Posted by maheshkr76private <ma...@gmail.com>.

Pavel, are ignite 2.7.6 binaries built with assertions disabled? THis could
explain the null pointer exception seen here on the server-side. I am still
not following if the null pointer exception, that I am reporting here is
understood and if there is a defect filed for this



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: GridCachePartitionExchangeManager Null pointer exception

Posted by Pavel Kovalenko <jo...@gmail.com>.

Mahesh,

Assertion error occurs if you run node with enabled assertions (jvm flag
-ea). If assertions are disabled it leads to NullPointer exception as you
have in logs.

сб, 5 окт. 2019 г. в 16:47, Mahesh Renduchintala <
mahesh.renduchintala@aline-consulting.com>:

> Pavel, I don't have the logs for the client node. It happened 2 times in
> our cluster till now in 45 days. Difficult to reproduce.
> But the logs show a null point exception on server nodes... 1st one server
> node (192.168.1.6) went down and then the other.
>
> In 12255, it is noted that an assertion could be seen on the coordinator,
> but this is a null pointer exception.
> Agree, the race condition, described in 12255 seems similar to the logs i
> attached. But just does not explain the null pointer exception.
>
> The race is the following:
>
> Client node (with some configured caches) joins to a cluster sending
> SingleMessage to coordinator during client PME. This SingleMessage contains
> affinity fetch requests for all cluster caches. When SingleMessage is
> in-flight server nodes finish client PME and also process and finish cache
> destroy PME. When a cache is destroyed affinity for that cache is cleared.
> When SingleMessage delivered to coordinator it doesn’t have affinity for a
> requested cache because the cache is already destroyed. *It leads to
> assertion error on the coordinator* and unpredictable behavior on the
> client node.
>
>
>

Re: GridCachePartitionExchangeManager Null pointer exception

Posted by Mahesh Renduchintala <ma...@aline-consulting.com>.

Pavel, I don't have the logs for the client node. It happened 2 times in our cluster till now in 45 days. Difficult to reproduce.
But the logs show a null point exception on server nodes... 1st one server node (192.168.1.6) went down and then the other.

In 12255, it is noted that an assertion could be seen on the coordinator, but this is a null pointer exception.
Agree, the race condition, described in 12255 seems similar to the logs i attached. But just does not explain the null pointer exception.


The race is the following:

Client node (with some configured caches) joins to a cluster sending SingleMessage to coordinator during client PME. This SingleMessage contains affinity fetch requests for all cluster caches. When SingleMessage is in-flight server nodes finish client PME and also process and finish cache destroy PME. When a cache is destroyed affinity for that cache is cleared. When SingleMessage delivered to coordinator it doesn’t have affinity for a requested cache because the cache is already destroyed. It leads to assertion error on the coordinator and unpredictable behavior on the client node.

Re: GridCachePartitionExchangeManager Null pointer exception

Posted by Pavel Kovalenko <jo...@gmail.com>.

Mahesh,

Do you have logs from the following thick client?
TcpDiscoveryNode [id=5204d16d-e6fc-4cc3-a1d9-17edf59f961e,
addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.1.171],
sockAddrs=[/0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0, /192.168.1.171:0],
discPort=0, order=1146, intOrder=579, lastExchangeTime=1569947734191,
loc=false, ver=2.7.6#20190911-sha1:21f7ca41, *isClient=true*]
I need to check it, may be I'm missing something.

пт, 4 окт. 2019 г. в 05:08, Mahesh Renduchintala <
mahesh.renduchintala@aline-consulting.com>:

> Hello Pavel,
>
> OK. I am a little bit not clear on the workaround you suggested on your
> previous comment
> >>>>As a workaround, I can suggest to not explicitly declare caches in the
> client configuration. During joining to cluster process, the client node
> will receive all configured caches from server nodes.
>
> In my scenario,
> a) there are absolutely no caches declared on my thick client side.
> b) The cache templates are declared on the server nodes and via SQL
> generated from thick client side, the caches are created.
>
> How do I implement the workaround you suggested?
>
> regards
> Mahesh
>
>

Re: GridCachePartitionExchangeManager Null pointer exception

Posted by Mahesh Renduchintala <ma...@aline-consulting.com>.

Hello Pavel,

OK. I am a little bit not clear on the workaround you suggested on your previous comment
>>>>As a workaround, I can suggest to not explicitly declare caches in the client configuration. During joining to cluster process, the client node will receive all configured caches from server nodes.

In my scenario,
a) there are absolutely no caches declared on my thick client side.
b) The cache templates are declared on the server nodes and via SQL generated from thick client side, the caches are created.

How do I implement the workaround you suggested?

regards
Mahesh

Re: GridCachePartitionExchangeManager Null pointer exception

Posted by maheshkr76private <ma...@gmail.com>.

Hello Pavel, 

OK. The place where I am a little bit not clear is on your below previous
comment 

>>>>As a workaround, I can suggest to not explicitly declare caches in the
client configuration. During joining to cluster process, the client node
will receive all configured caches from server nodes.

In my scenario, there are absolutely no caches declared on my thick client
side. 
How do I implement this workaround?

regards
Mahesh



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: GridCachePartitionExchangeManager Null pointer exception

Posted by Pavel Kovalenko <jo...@gmail.com>.

Mahesh,

According to your logs and exception what I see, the issue you mentioned is
not related to your problem.
The similar with IGNITE-10010 problem is
https://issues.apache.org/jira/browse/IGNITE-9562

You have thick client join to server topology:
[16:35:34,948][INFO][disco-event-worker-#50][GridDiscoveryManager] Added
new node to topology: TcpDiscoveryNode
[id=5204d16d-e6fc-4cc3-a1d9-17edf59f961e, addrs=[0:0:0:0:0:0:0:1%lo,
127.0.0.1, 192.168.1.171], sockAddrs=[/0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0, /
192.168.1.171:0], discPort=0, order=1146, intOrder=579,
lastExchangeTime=1569947734191, loc=false,
ver=2.7.6#20190911-sha1:21f7ca41, *isClient=true*]
Which causes Partitions Map Exchange on version *[1146, 0]*:
[16:35:34,949][INFO][exchange-worker-#51][time] Started exchange init
[topVer=AffinityTopologyVersion *[topVer=1146, minorTopVer=0]*,
mvccCrd=MvccCoordinator [nodeId=84de670f-49e6-4dd8-9d14-4855fdd5acdf,
crdVer=1569681573983, topVer=AffinityTopologyVersion [topVer=2,
minorTopVer=0]], mvccCrdChange=false, crd=false, evt=NODE_JOINED,
evtNode=5204d16d-e6fc-4cc3-a1d9-17edf59f961e, customEvt=null,
allowMerge=true]
Right after you have 2 cache destroy events.
And the server node is down during process a single message from the thick
client on version *[1146, 0]*:
[16:36:08,567][SEVERE][sys-#37524][GridCacheIoManager] Failed processing
message [senderId=5204d16d-e6fc-4cc3-a1d9-17edf59f961e,
msg=GridDhtPartitionsSingleMessage [parts=null, partCntrs=null,
partsSizes=null, partHistCntrs=null, err=null, client=true, finishMsg=null,
activeQryTrackers=null, super=GridDhtPartitionsAbstractMessage
[exchId=GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion
*[topVer=1146,
minorTopVer=0]*, discoEvt=null, nodeId=5204d16d, evt=NODE_JOINED],
lastVer=GridCacheVersion [topVer=181162717, order=1569940014325,
nodeOrder=1144], super=GridCacheMessage [msgId=7894, depInfo=null,
err=null, skipPrepare=false]]]]
java.lang.NullPointerException
This is exactly the same problem described in ticket I mentioned in
previous message.


чт, 3 окт. 2019 г. в 15:04, Mahesh Renduchintala <
mahesh.renduchintala@aline-consulting.com>:

> Pavel, Thanks for your analysis. The two logs, that I attached, are those
> of two server data nodes (none are configured in thick client mode).
> The logs did show a server data node, losing connection and try to connect
> back to the other node (192.168.1.6)...
>
> On second thoughts, the below still makes sense.
> https://issues.apache.org/jira/browse/IGNITE-10010
>
> Please check.
>
>

Re: GridCachePartitionExchangeManager Null pointer exception

Posted by Mahesh Renduchintala <ma...@aline-consulting.com>.

Pavel, Thanks for your analysis. The two logs, that I attached, are those of two server data nodes (none are configured in thick client mode).
The logs did show a server data node, losing connection and try to connect back to the other node (192.168.1.6)...

On second thoughts, the below still makes sense.
https://issues.apache.org/jira/browse/IGNITE-10010

Please check.

Re: GridCachePartitionExchangeManager Null pointer exception

Posted by Pavel Kovalenko <jo...@gmail.com>.

Hi Mahesh,

Your problem is described here:
https://issues.apache.org/jira/browse/IGNITE-12255
The section starts with "This solution showed the existing race between
client node join and concurrent cache destroy."
According to your logs, I see concurrent client node join and stop caches
"SQL_PUBLIC_INCOME_DATASET_MALLIKARJUNA" and "income_dataset_Mallikarjuna".
I think some of them are configured on the client node explicitly.

This problem is already fixed in an open-source fork of Ignite and will be
donated to Ignite soon.
As a workaround, I can suggest to not explicitly declare caches in the
client configuration. During joining to cluster process client node will
receive all configured caches from server nodes.


ср, 2 окт. 2019 г. в 12:17, Mahesh Renduchintala <
mahesh.renduchintala@aline-consulting.com>:

> This seems to be a new bug, and unrelated to IGNITE-10010.
> Both the nodes were fully operational when the null pointer exception
> happened.
> The logs show that and both the nodes crashed
>
> Can you give some insights into this, possible scenarios this could have
> led this?
> Is there any potential workaround?
>
>

Re: GridCachePartitionExchangeManager Null pointer exception

Posted by Mahesh Renduchintala <ma...@aline-consulting.com>.

This seems to be a new bug, and unrelated to IGNITE-10010.
Both the nodes were fully operational when the null pointer exception happened.
The logs show that and both the nodes crashed

Can you give some insights into this, possible scenarios this could have led this?
Is there any potential workaround?