You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2017/02/22 19:01:44 UTC

[jira] [Commented] (DRILL-5289) Drill should handle OOM due to insufficient heap type of errors more gracefully

    [ https://issues.apache.org/jira/browse/DRILL-5289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15878985#comment-15878985 ] 

Paul Rogers commented on DRILL-5289:
------------------------------------

In general, Java programs cannot gracefully handle heap exhaustion: any attempt to do work requires creating objects which cannot be done because... well... the heap is exhausted.

A better solution is to manage heap resource usage: understand our usage, plan for it and inform the user of the heap needs. Since we don't understand our heap usage, we may well be creating large objects on the heap unnecessarily, or creating so many objects that the GC kicks in too frequently.

So, I'd reword this to not pre-suppose a solution. The solution is not to exhaust memory, then deal with it. The solution is to mange memory so that we don't exhaust heap in the first place.

> Drill should handle OOM due to insufficient heap type of errors more gracefully
> -------------------------------------------------------------------------------
>
>                 Key: DRILL-5289
>                 URL: https://issues.apache.org/jira/browse/DRILL-5289
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Flow, Execution - RPC
>    Affects Versions: 1.10.0
>            Reporter: Rahul Challapalli
>         Attachments: jstack.txt, partial_log.txt, Screen Shot 2017-02-22 at 10.58.39 AM (2).png
>
>
> [Git Commit ID will be updated soon]
> The below query which uses the managed sort causes an OOM error due to insufficient heap, which is a bug in itself. 
> {code}
> ALTER SESSION SET `exec.sort.disable_managed` = false;
> +-------+-------------------------------------+
> |  ok   |               summary               |
> +-------+-------------------------------------+
> | true  | exec.sort.disable_managed updated.  |
> +-------+-------------------------------------+
> 1 row selected (1.096 seconds)
> 0: jdbc:drill:zk=10.10.100.183:5181> alter session set `planner.memory.max_query_memory_per_node` = 14106127360;
> +-------+----------------------------------------------------+
> |  ok   |                      summary                       |
> +-------+----------------------------------------------------+
> | true  | planner.memory.max_query_memory_per_node updated.  |
> +-------+----------------------------------------------------+
> 1 row selected (0.253 seconds)
> 0: jdbc:drill:zk=10.10.100.183:5181> alter session set `planner.width.max_per_node` = 1;
> +-------+--------------------------------------+
> |  ok   |               summary                |
> +-------+--------------------------------------+
> | true  | planner.width.max_per_node updated.  |
> +-------+--------------------------------------+
> 1 row selected (0.184 seconds)
> 0: jdbc:drill:zk=10.10.100.183:5181> select * from (select * from dfs.`/drill/testdata/resource-manager/250wide.tbl` order by columns[0])d where d.columns[0] = 'ljdfhwuehnoiueyf';
> {code}
> Once the OOM happens chaos follows
> {code}
> 1. Dangling fragments are left behind
> 2. Query fails but zookeeper thinks its still running
> 3. Client connection timeouts
> 4. Profile page shows the same query as both running and failed.
> {code}
> We should be handling this situation more gracefully as this could be perceived as a drillbit stability issue. I attached the jstack. The logs and data set used are too big to upload here. Reach out to me if you need more information.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)