You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Rahul Challapalli (JIRA)" <ji...@apache.org> on 2017/03/28 22:56:41 UTC
[jira] [Commented] (DRILL-4272) When sort runs out of memory and query fails, resources are seemingly not freed

    [ https://issues.apache.org/jira/browse/DRILL-4272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15946128#comment-15946128 ] 

Rahul Challapalli commented on DRILL-4272:
------------------------------------------

[~Paul.Rogers] I do not see any PR associated with this jira. I am wondering why it is marked as closed. Were you unable to reproduce this issue with the existing sort based on the criteria mentioned by Vicky?

> When sort runs out of memory and query fails, resources are seemingly not freed
> -------------------------------------------------------------------------------
>
>                 Key: DRILL-4272
>                 URL: https://issues.apache.org/jira/browse/DRILL-4272
>             Project: Apache Drill
>          Issue Type: Sub-task
>          Components: Execution - Relational Operators
>    Affects Versions: 1.5.0
>            Reporter: Victoria Markman
>            Assignee: Paul Rogers
>            Priority: Critical
>             Fix For: 1.10.0
>
>
> Executed query11.sql from resources/Advanced/tpcds/tpcds_sf1/original/parquet
> Query runs out of memory:
> {code}
> Error: RESOURCE ERROR: One or more nodes ran out of memory while executing the query.
> Unable to allocate sv2 for 32768 records, and not enough batchGroups to spill.
> batchGroups.size 1
> spilledBatchGroups.size 0
> allocated memory 19961472
> allocator limit 20000000
> Fragment 19:0
> [Error Id: 87aa32b8-17eb-488e-90cb-5f5bffff9aec on atsqa4-133.qa.lab:31010] (state=,code=0)
> {code}
> And leaves fragments running, holding resources:
> {code}
> 2016-01-14 22:46:32,435 [Drillbit-ShutdownHook#0] INFO  o.apache.drill.exec.server.Drillbit - Received shutdown request.
> 2016-01-14 22:46:32,546 [Curator-ServiceCache-0] WARN  o.a.d.e.w.fragment.FragmentExecutor - Foreman atsqa4-136.qa.lab no longer active.  Cancelling fragment 2967db08-cd38-925a-4960-9e881f537af8:19:0.
> 2016-01-14 22:46:32,547 [Curator-ServiceCache-0] INFO  o.a.d.e.w.fragment.FragmentExecutor - 2967db08-cd38-925a-4960-9e881f537af8:19:0: State change requested CANCELLATION_REQUESTED --> CANCELLATION_REQUESTED
> 2016-01-14 22:46:32,547 [Curator-ServiceCache-0] WARN  o.a.d.e.w.fragment.FragmentExecutor - 2967db08-cd38-925a-4960-9e881f537af8:19:0: Ignoring unexpected state transition CANCELLATION_REQUESTED --> CANCELLATION_REQUESTED
> 2016-01-14 22:46:32,547 [Curator-ServiceCache-0] WARN  o.a.d.e.w.fragment.FragmentExecutor - Foreman atsqa4-136.qa.lab no longer active.  Cancelling fragment 2967db08-cd38-925a-4960-9e881f537af8:17:0.
> 2016-01-14 22:46:32,547 [Curator-ServiceCache-0] INFO  o.a.d.e.w.fragment.FragmentExecutor - 2967db08-cd38-925a-4960-9e881f537af8:17:0: State change requested CANCELLATION_REQUESTED --> CANCELLATION_REQUESTED
> 2016-01-14 22:46:32,547 [Curator-ServiceCache-0] WARN  o.a.d.e.w.fragment.FragmentExecutor - 2967db08-cd38-925a-4960-9e881f537af8:17:0: Ignoring unexpected state transition CANCELLATION_REQUESTED --> CANCELLATION_REQUESTED
> 2016-01-14 22:46:33,563 [BitServer-1] INFO  o.a.d.exec.rpc.control.ControlClient - Channel closed /10.10.88.134:59069 <--> atsqa4-136.qa.lab/10.10.88.136:31011.
> 2016-01-14 22:46:33,563 [BitClient-1] INFO  o.a.drill.exec.rpc.data.DataClient - Channel closed /10.10.88.134:34802 <--> atsqa4-136.qa.lab/10.10.88.136:31012.
> 2016-01-14 22:46:33,590 [BitClient-1] INFO  o.a.drill.exec.rpc.data.DataClient - Channel closed /10.10.88.134:36937 <--> atsqa4-135.qa.lab/10.10.88.135:31012.
> 2016-01-14 22:46:33,595 [BitClient-1] INFO  o.a.drill.exec.rpc.data.DataClient - Channel closed /10.10.88.134:53860 <--> atsqa4-133.qa.lab/10.10.88.133:31012.
> 2016-01-14 22:46:38,467 [BitClient-1] INFO  o.a.drill.exec.rpc.data.DataClient - Channel closed /10.10.88.134:48276 <--> atsqa4-134.qa.lab/10.10.88.134:31012.
> 2016-01-14 22:46:39,470 [pool-6-thread-1] INFO  o.a.drill.exec.rpc.user.UserServer - closed eventLoopGroup io.netty.channel.nio.NioEventLoopGroup@6fb32dfb in 1003 ms
> 2016-01-14 22:46:39,470 [pool-6-thread-2] INFO  o.a.drill.exec.rpc.data.DataServer - closed eventLoopGroup io.netty.channel.nio.NioEventLoopGroup@5c93dd80 in 1003 ms
> 2016-01-14 22:46:39,470 [pool-6-thread-1] INFO  o.a.drill.exec.service.ServiceEngine - closed userServer in 1004 ms
> 2016-01-14 22:46:39,470 [pool-6-thread-2] INFO  o.a.drill.exec.service.ServiceEngine - closed dataPool in 1005 ms
> 2016-01-14 22:46:39,483 [Drillbit-ShutdownHook#0] WARN  o.apache.drill.exec.work.WorkManager - Closing WorkManager but there are 2 running fragments.
> 2016-01-14 22:46:41,489 [Drillbit-ShutdownHook#0] ERROR o.a.d.exec.server.BootStrapContext - Pool did not terminate
> 2016-01-14 22:46:41,498 [Drillbit-ShutdownHook#0] WARN  o.apache.drill.exec.server.Drillbit - Failure on close()
> java.lang.RuntimeException: Exception while closing
>         at org.apache.drill.common.DrillAutoCloseables.closeNoChecked(DrillAutoCloseables.java:46) ~[drill-common-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.exec.server.BootStrapContext.close(BootStrapContext.java:127) ~[drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.common.AutoCloseables.close(AutoCloseables.java:76) ~[drill-common-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.common.AutoCloseables.close(AutoCloseables.java:64) ~[drill-common-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.exec.server.Drillbit.close(Drillbit.java:149) ~[drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.exec.server.Drillbit$ShutdownThread.run(Drillbit.java:243) [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> Caused by: java.util.ConcurrentModificationException: null
>         at java.util.IdentityHashMap$IdentityHashMapIterator.nextIndex(IdentityHashMap.java:732) ~[na:1.7.0_71]
>         at java.util.IdentityHashMap$KeyIterator.next(IdentityHashMap.java:822) ~[na:1.7.0_71]
>         at org.apache.drill.exec.memory.BaseAllocator.print(BaseAllocator.java:683) ~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.exec.memory.BaseAllocator.print(BaseAllocator.java:679) ~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.exec.memory.BaseAllocator.print(BaseAllocator.java:679) ~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.exec.memory.BaseAllocator.toString(BaseAllocator.java:483) ~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.exec.memory.BaseAllocator.close(BaseAllocator.java:431) ~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.common.DrillAutoCloseables.closeNoChecked(DrillAutoCloseables.java:44) ~[drill-common-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         ... 5 common frames omitted
> 2016-01-14 22:46:41,499 [Drillbit-ShutdownHook#0] INFO  o.apache.drill.exec.server.Drillbit - Shutdown completed (9062 ms).
> {code}
> or this stack trace:
> {code}
> 2016-01-15 00:59:09,699 [Drillbit-ShutdownHook#0] WARN  o.apache.drill.exec.work.WorkManager - Closing WorkManager but there are 1 running fragments.
> 2016-01-15 00:59:11,704 [Drillbit-ShutdownHook#0] ERROR o.a.d.exec.server.BootStrapContext - Pool did not terminate
> 2016-01-15 00:59:11,708 [Drillbit-ShutdownHook#0] WARN  o.apache.drill.exec.server.Drillbit - Failure on close()
> java.lang.RuntimeException: Exception while closing
>         at org.apache.drill.common.DrillAutoCloseables.closeNoChecked(DrillAutoCloseables.java:46) ~[drill-common-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.exec.server.BootStrapContext.close(BootStrapContext.java:127) ~[drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.common.AutoCloseables.close(AutoCloseables.java:76) ~[drill-common-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.common.AutoCloseables.close(AutoCloseables.java:64) ~[drill-common-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.exec.server.Drillbit.close(Drillbit.java:149) ~[drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.exec.server.Drillbit$ShutdownThread.run(Drillbit.java:243) [drill-java-exec-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
> Caused by: java.lang.IllegalStateException: allocator[frag:19:0]: buffer space (0) + prealloc space (0) + child space (111759688) != allocated (111104328)
>         at org.apache.drill.exec.memory.BaseAllocator.verifyAllocator(BaseAllocator.java:653) ~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.exec.memory.BaseAllocator.verifyAllocator(BaseAllocator.java:557) ~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.exec.memory.BaseAllocator.verifyAllocator(BaseAllocator.java:528) ~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.exec.memory.BaseAllocator.close(BaseAllocator.java:419) ~[drill-memory-base-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         at org.apache.drill.common.DrillAutoCloseables.closeNoChecked(DrillAutoCloseables.java:44) ~[drill-common-1.5.0-SNAPSHOT.jar:1.5.0-SNAPSHOT]
>         ... 5 common frames omitted
> 2016-01-15 00:59:11,708 [Drillbit-ShutdownHook#0] INFO  o.apache.drill.exec.server.Drillbit - Shutdown completed (9058 ms).
> {code}
> Fragment 19:0 is sort that failed with OOM:
> {code}
> 2016-01-15 00:58:00,552 [2967bc04-43e6-6e2c-e91f-4401daae16cc:frag:19:0] INFO  o.a.d.e.p.i.xsort.ExternalSortBatch - User Error Occurred
> org.apache.drill.common.exceptions.UserException: RESOURCE ERROR: One or more nodes ran out of memory while executing the query.
> {code}
> My setup (I'm sure it can be scaled down to one node)
> * 4 node cluster
> * each box is 32 cores
> * 48 GB direct memory
> * 10GB memory allocated to sort
> Drill settings:
> {code}
> 0: jdbc:drill:schema=dfs> select * from sys.options where status like '%CHANGED%';
> +-------------------------------------------+----------+---------+----------+--------------+-------------+-----------+------------+
> |                   name                    |   kind   |  type   |  status  |   num_val    | string_val  | bool_val  | float_val  |
> +-------------------------------------------+----------+---------+----------+--------------+-------------+-----------+------------+
> | planner.enable_decimal_data_type          | BOOLEAN  | SYSTEM  | CHANGED  | null         | null        | true      | null       |
> | planner.enable_hashjoin                   | BOOLEAN  | SYSTEM  | CHANGED  | null         | null        | false     | null       |
> | planner.memory.max_query_memory_per_node  | LONG     | SYSTEM  | CHANGED  | 10737418240  | null        | null      | null       |
> +-------------------------------------------+----------+---------+----------+--------------+-------------+-----------+------------+
> 3 rows selected (3.532 seconds)
> {code}
> Or maybe we don't wait for anything to complete and just shut everything down ?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)