You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Taras Ledkov (JIRA)" <ji...@apache.org> on 2016/12/07 14:07:58 UTC
[jira] [Comment Edited] (IGNITE-3558) Affinity task hangs when
Collision SPI produces a lot of job rejections & Failover SPI produces many
attempts
[ https://issues.apache.org/jira/browse/IGNITE-3558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15722540#comment-15722540 ]
Taras Ledkov edited comment on IGNITE-3558 at 12/7/16 2:07 PM:
---------------------------------------------------------------
Pull request to run tests: [pull/1326|https://github.com/apache/ignite/pull/1326]
was (Author: tledkov-gridgain):
Pull request to run tests: [pull/1316|https://github.com/apache/ignite/pull/1316]
> Affinity task hangs when Collision SPI produces a lot of job rejections & Failover SPI produces many attempts
> -------------------------------------------------------------------------------------------------------------
>
> Key: IGNITE-3558
> URL: https://issues.apache.org/jira/browse/IGNITE-3558
> Project: Ignite
> Issue Type: Bug
> Components: compute
> Reporter: Taras Ledkov
> Assignee: Taras Ledkov
> Fix For: 2.0
>
> Time Spent: 3h
> Remaining Estimate: 0h
>
> The test to reproduce:
> IgniteCacheLockPartitionOnAffinityRunWithCollisionSpiTest#testJobFinishing
> *Root cause*
> GridJobExecuteResponse isn't set from target node because there is a confusion with GridJobWorker instances in the CollisionContext.
> *Suggestion*
> The method GridJobProcessor.CollisionJobContext.cancel()
> use passiveJobs.remove(jobWorker.getJobId(), jobWorker).
> *passiveJobs* is a ConcurrentHashMap and GridJobWorker.equals() implements as a equation of jobId.
> So, when two thread try to cancel the two workers with *the same jobIds* we have the case:
> - thread0 remove jobWorker0 & cancel jobWorker0.
> - thread0 put jobWorker1 (because jobWorker0 already removed);
> - thread1: (has a copy of jobWorker0) and try to cancel it.
> - thread1: remove jobWorker1 instead of jobWorker0 (because jobId is used to identify);
> - thread1: doesn't send ExecuteResponse because jobWorker0 has been canceled.
> *Proposal*
> Try to use system default equals for the GridJobWorker
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)