You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Denis Magda (JIRA)" <ji...@apache.org> on 2015/08/12 16:01:46 UTC

[jira] [Comment Edited] (IGNITE-1239) Cache partition iterator throws exception when concurrent rebalancing is running

    [ https://issues.apache.org/jira/browse/IGNITE-1239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14693515#comment-14693515 ] 

Denis Magda edited comment on IGNITE-1239 at 8/12/15 2:01 PM:
--------------------------------------------------------------

Created a test to reproduce the bug.
The test initially start 3 nodes. When all 3 nodes are ready the test starts executing per-partition queries from one Thread and launching/shutting down additional nodes from additional multiple Threads.

Could reproduce "Partition can't be reserved" error only periodically.

But the test exposed one more bug - it was constantly hanging on partitions exchange. Spent almost all the day for issue debugging.
The hanging was caused by the fact that a scan iterator wasn't closed for local partitions of the node that initially executed query. This affected rebalancing when a new node joined or left topology.
Fixed this issue in {{GridCacheQueryManager}}, the test doesn't hang any more.

"Partition can't be reserved" error no longer reproduced as well. However, will try to improve the test to double-check that the issue has been fixed as well.

The fix for the issue with local scan iterators can be review in the attached patch.


was (Author: dmagda):
Created a test to reproduce the bug.
The test initially start 3 nodes. When all 3 nodes are ready the test starts executing per-partition queries from one Thread and launching/shutting down additional nodes from additional multiple Threads.

Could reproduce "Partition can't be reserved" error only periodically.

But the test exposed one more bug - it was constantly hanging on partitions exchange. Spent almost all the day for issue debugging.
The hanging was caused by the fact that a scan iterator wasn't closed for local partitions of the node that initially executed query. This affected rebalancing when a new node joined or left topology.
Fixed this issue in {{GridCacheQueryManager}}, the test doesn't hang any more.

"Partition can't be reserved" error no longer reproduced. However, will try to improve the test to double-check that the issue has been fixed as well.

The fix for the issue with local scan iterators can be review in the attached patch.

> Cache partition iterator throws exception when concurrent rebalancing is running
> --------------------------------------------------------------------------------
>
>                 Key: IGNITE-1239
>                 URL: https://issues.apache.org/jira/browse/IGNITE-1239
>             Project: Ignite
>          Issue Type: Bug
>          Components: cache
>            Reporter: Alexey Goncharuk
>            Assignee: Denis Magda
>
> I observed this exception when IgniteRDD was iterating over partition and two new nodes have joined:
> {code}
> Caused by: class org.apache.ignite.IgniteCheckedException: Query execution failed: GridCacheQueryBean [qry=GridCacheQueryAdapter [type=SCAN, clsName=null, clause=null, filter=org.apache.ignite.internal.processors.cache.IgniteCacheProxy$1@6490c94c, part=138, incMeta=false, metrics=GridCacheQueryMetricsAdapter [minTime=10, maxTime=10, avgTime=10.0, execs=1, fails=1, executed=true], pageSize=1024, timeout=0, keepAll=true, incBackups=false, dedup=false, prj=null, keepPortable=false, subjId=9cdc9751-c6ec-43eb-968a-e941f2a1a8cd, taskHash=0], rdc=null, trans=null]
> 	at org.apache.ignite.internal.processors.cache.query.GridCacheQueryFutureAdapter.checkError(GridCacheQueryFutureAdapter.java:245)
> 	at org.apache.ignite.internal.processors.cache.query.GridCacheQueryFutureAdapter.internalIterator(GridCacheQueryFutureAdapter.java:303)
> 	at org.apache.ignite.internal.processors.cache.query.GridCacheQueryFutureAdapter.next(GridCacheQueryFutureAdapter.java:156)
> 	... 17 more
> Caused by: class org.apache.ignite.IgniteCheckedException: Failed to execute query on node [query=GridCacheQueryBean [qry=GridCacheQueryAdapter [type=SCAN, clsName=null, clause=null, filter=org.apache.ignite.internal.processors.cache.IgniteCacheProxy$1@6490c94c, part=138, incMeta=false, metrics=GridCacheQueryMetricsAdapter [minTime=0, maxTime=0, avgTime=0.0, execs=0, fails=0, executed=false], pageSize=1024, timeout=0, keepAll=true, incBackups=false, dedup=false, prj=null, keepPortable=false, subjId=9cdc9751-c6ec-43eb-968a-e941f2a1a8cd, taskHash=0], rdc=null, trans=null], nodeId=963d0e35-7805-4b6d-8d64-22cce84e35f2]
> 	at org.apache.ignite.internal.processors.cache.query.GridCacheQueryFutureAdapter.onPage(GridCacheQueryFutureAdapter.java:370)
> 	at org.apache.ignite.internal.processors.cache.query.GridCacheDistributedQueryManager.processQueryResponse(GridCacheDistributedQueryManager.java:377)
> 	at org.apache.ignite.internal.processors.cache.query.GridCacheDistributedQueryManager.access$000(GridCacheDistributedQueryManager.java:44)
> 	at org.apache.ignite.internal.processors.cache.query.GridCacheDistributedQueryManager$1.apply(GridCacheDistributedQueryManager.java:74)
> 	at org.apache.ignite.internal.processors.cache.query.GridCacheDistributedQueryManager$1.apply(GridCacheDistributedQueryManager.java:72)
> 	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:534)
> 	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:240)
> 	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:48)
> 	at org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1026)
> 	at org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2256)
> 	at org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:946)
> 	at org.apache.ignite.internal.managers.communication.GridIoManager.access$1700(GridIoManager.java:60)
> 	at org.apache.ignite.internal.managers.communication.GridIoManager$6.run(GridIoManager.java:915)
> 	... 3 more
> Caused by: class org.apache.ignite.IgniteCheckedException: Partition can't be reserved
> 	at org.apache.ignite.internal.util.IgniteUtils.cast(IgniteUtils.java:6808)
> {code}
> The issue is that query request was sent on a backup node and by the time request has arrived, the partition was already evicted, which resulted in "Partition cannot be reserved" exception. We should automatically retry if this exception is encountered.
> I believe we have logic that retries, but it looks like there is a bug in that logic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)