You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Sergey Grimstad (JIRA)" <ji...@apache.org> on 2018/08/07 13:28:00 UTC

[jira] [Assigned] (IGNITE-9141) SQL: Trace and test query mapping problems

     [ https://issues.apache.org/jira/browse/IGNITE-9141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sergey Grimstad reassigned IGNITE-9141:
---------------------------------------

    Assignee: Sergey Grimstad

> SQL: Trace and test query mapping problems
> ------------------------------------------
>
>                 Key: IGNITE-9141
>                 URL: https://issues.apache.org/jira/browse/IGNITE-9141
>             Project: Ignite
>          Issue Type: Task
>          Components: sql
>    Affects Versions: 2.6
>            Reporter: Vladimir Ozerov
>            Assignee: Sergey Grimstad
>            Priority: Major
>             Fix For: 2.7
>
>
> One of mandatory steps of SQL query execution is topology mapping - we need to select nodes where required caches are located, and make sure that their partition distribution is valid for the given SQL query. Once nodes are detected, we try to reserve partitions of interest on mapper nodes to make sure that they will not be evicted during query execution. 
> However, mapping step may fail for many reasons. Most often this is rebalance or concurrent node failures. In this case we simply retry the whole query execution from scratch. In IGNITE-9114 we ensured that retry cycle is not infinite and that root cause of remap is logged. However, original root cause of remap is not propagated to client node making the problem hard to debug for end users. Also we do not have enough tests for remap events. Let's fix this.
> Proposed implementation flow:
> 1) Add {{retryCause: String}} field to {{GridQueryNextPageResponse}} which should be populated along with {{retry}} field on mapper node. See {{GridMapQueryExecutor#sendRetry}} method to understand what may cause retries (failed to reserve partitions or failed to execute non-collocated join). Make sure that these error messages are as verbose as possible with all necessary details (root cause, cache names, affected partitions, etc).
> 2) Make sure that root cause is set in {{ReduceQueryRun#state}} and then propagated to user exception in case of retry timeout.
> 3) Evaluate all places inside {{org.apache.ignite.internal.processors.query.h2.twostep.GridReduceQueryExecutor#query}} which may lead to re-try and make sure that root cause is verbose and propagated to user exception in case of retry timeout. 
> 4) Add tests covering all re-try branches and ensure that query fails after timeout and that error message is correct.
> *NB*: Once propagation of error message to reducer is implemented, we may remove additional logging altogether.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)