You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2017/05/03 21:11:04 UTC

[jira] [Comment Edited] (DRILL-5470) External Sort - Unable to Allocate Buffer error

    [ https://issues.apache.org/jira/browse/DRILL-5470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15995687#comment-15995687 ] 

Paul Rogers edited comment on DRILL-5470 at 5/3/17 9:10 PM:
------------------------------------------------------------

The key lines are:

2017-05-03 12:02:56,133 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:6] DEBUG o.a.d.e.p.i.x.m.ExternalSortBatch - Config: memory limit = 10737418240, spill file size = 268435456, spill batch size = 8388608, merge limit = 2147483647, merge batch size = 16777216

The above above is for the thread that eventually dies. It says that the sort has 10,737,418,240 (10 GB) for buffering.

…
2017-05-03 12:02:58,958 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:6] DEBUG o.a.d.e.p.i.x.m.ExternalSortBatch - Actual batch schema & sizes {
  $f0(type: VARCHAR, std col. size: 54, actual col. size: 35, total size: 303104, data size: 276748, row capacity: 8191, density: 92)
  Records: 8096, Total size: 319488, Row width:41, Density:92}

The above describes your actual data. A Varchar, average column width of 35 bytes, input batch size of only 300 K, 8191 rows. This is quite reasonable, nothing at all bizarre.

2017-05-03 12:02:58,983 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:6] DEBUG o.a.d.e.p.i.x.m.ExternalSortBatch - Input Batch Estimates: record size = 35 bytes; input batch = 319488 bytes, 8096 records
2017-05-03 12:02:58,983 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:6] DEBUG o.a.d.e.p.i.x.m.ExternalSortBatch - Merge batch size = 4587450 bytes, 65535 records; spill file size: 268435456 bytes
2017-05-03 12:02:58,983 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:6] DEBUG o.a.d.e.p.i.x.m.ExternalSortBatch - Output batch size = 4587450 bytes, 65535 records
2017-05-03 12:02:58,983 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:6] DEBUG o.a.d.e.p.i.x.m.ExternalSortBatch - Available memory: 10737418240, buffer memory = 10727923852, merge memory = 10723655890

The above explains the memory plan for this sort. For our purposes, the “buffer memory” says we can store in-memory batches until we reach a limit of 10,727,923,852 bytes. This leaves about 10 MB to create the merge batch which we estimate to be about 4,587,450 in size. (We give ourselves a 100% safety margin. The question is, why was that not enough?)

…

Then, we see this:

2017-05-03 12:38:02,927 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:6] INFO  o.a.d.e.p.i.x.m.ExternalSortBatch - User Error Occurred: External Sort encountered an error while spilling to disk (Unable to allocate buffer of size 268435456 due to memory limit. Current allocation: 10579849472)
org.apache.drill.common.exceptions.UserException: RESOURCE ERROR: External Sort encountered an error while spilling to disk

At the debug level of logging, we can’t see how we got here. Trace logging provides all the gory details.

The message says that the current allocation is 10,579,849,472. We saw earlier that the limit is 10,737,418,240. So, we should have 157 MB free.

The error says, "Unable to allocate buffer of size 268,435,456”. It is larger than the 157 MB free memory, and so is a valid error.

But, here is where things get strange. this is far more than the 4.5 MB we guessed for the output batch size. Given that columns are 35 bytes each, this says we are trying to create a Varchar of 7.6 million rows. But, our batches (vectors) only allow 64K rows.

Finally, these lines confirm that the sort is still in the sort stage:

	at org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.doMergeAndSpill(ExternalSortBatch.java:1447) [drill-java-exec-1.10.0.jar:1.10.0]
	at org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.mergeAndSpill(ExternalSortBatch.java:1376) [drill-java-exec-1.10.0.jar:1.10.0]
	at org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.spillFromMemory(ExternalSortBatch.java:1339) [drill-java-exec-1.10.0.jar:1.10.0]
	at org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.processBatch(ExternalSortBatch.java:831) [drill-java-exec-1.10.0.jar:1.10.0]

The upshot is that something is seriously off. Looks like you’ve done everything correctly. Looks like most of the sort is reporting correct behavior. But, the spill operation has gone seriously wrong.

It is not clear why you would have gotten this error on the “old” sort as well as the “managed” version.

My next step is to look through the code to see if I can account for the behavior shown in the logs.


was (Author: paul-rogers):
The key lines are:

2017-05-03 12:02:56,133 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:6] DEBUG o.a.d.e.p.i.x.m.ExternalSortBatch - Config: memory limit = 10737418240, spill file size = 268435456, spill batch size = 8388608, merge limit = 2147483647, merge batch size = 16777216

The above above is for the thread that eventually dies. It says that the sort has 10,737,418,240 (10 GB) for buffering.

…
2017-05-03 12:02:58,958 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:6] DEBUG o.a.d.e.p.i.x.m.ExternalSortBatch - Actual batch schema & sizes {
  $f0(type: VARCHAR, std col. size: 54, actual col. size: 35, total size: 303104, data size: 276748, row capacity: 8191, density: 92)
  Records: 8096, Total size: 319488, Row width:41, Density:92}

The above describes your actual data. A Varchar, average column width of 35 bytes, input batch size of only 300 K, 8191 rows. This is quite reasonable, nothing at all bizarre.

2017-05-03 12:02:58,983 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:6] DEBUG o.a.d.e.p.i.x.m.ExternalSortBatch - Input Batch Estimates: record size = 35 bytes; input batch = 319488 bytes, 8096 records
2017-05-03 12:02:58,983 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:6] DEBUG o.a.d.e.p.i.x.m.ExternalSortBatch - Merge batch size = 4587450 bytes, 65535 records; spill file size: 268435456 bytes
2017-05-03 12:02:58,983 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:6] DEBUG o.a.d.e.p.i.x.m.ExternalSortBatch - Output batch size = 4587450 bytes, 65535 records
2017-05-03 12:02:58,983 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:6] DEBUG o.a.d.e.p.i.x.m.ExternalSortBatch - Available memory: 10737418240, buffer memory = 10727923852, merge memory = 10723655890

The above explains the memory plan for this sort. For our purposes, the “buffer memory” says we can store in-memory batches until we reach a limit of 10,727,923,852 bytes. This leaves about 10 MB to create the merge batch which we estimate to be about 4,587,450 in size. (We give ourselves a 100% safety margin. The question is, why was that not enough?)

…

Then, we see this:

2017-05-03 12:38:02,927 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:6] INFO  o.a.d.e.p.i.x.m.ExternalSortBatch - User Error Occurred: External Sort encountered an error while spilling to disk (Unable to allocate buffer of size 268435456 due to memory limit. Current allocation: 10579849472)
org.apache.drill.common.exceptions.UserException: RESOURCE ERROR: External Sort encountered an error while spilling to disk

At the debug level of logging, we can’t see how we got here. Trace logging provides all the gory details.

The message says that the current allocation is 10,579,849,472. We saw earlier that the limit is 10,737,418,240. So, we should have 157 MB free.

What is strange is the error: "Unable to allocate buffer of size 268,435,456”. It is larger than the 157 MB free memory, and so is a valid error.

But, here is where things get strange. this is far more than the 4.5 MB we guessed for the output batch size. Given that columns are 35 bytes each, this says we are trying to create a Varchar of 7.6 million rows. But, our batches (vectors) only allow 64K rows.

Quite the mystery! This is what I was alluding to yesterday when I mentioned the error does not look like a typical “something is adding up wrong in the sort” error.

Finally, these lines confirm that the sort is still in the sort stage:

	at org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.doMergeAndSpill(ExternalSortBatch.java:1447) [drill-java-exec-1.10.0.jar:1.10.0]
	at org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.mergeAndSpill(ExternalSortBatch.java:1376) [drill-java-exec-1.10.0.jar:1.10.0]
	at org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.spillFromMemory(ExternalSortBatch.java:1339) [drill-java-exec-1.10.0.jar:1.10.0]
	at org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.processBatch(ExternalSortBatch.java:831) [drill-java-exec-1.10.0.jar:1.10.0]

The upshot is that something is seriously off. Looks like you’ve done everything correctly. Looks like most of the sort is reporting correct behavior. But, the spill operation has gone seriously wrong.

It is not clear why you would have gotten this error on the “old” sort as well as the “managed” version.

My next step is to look through the code to see if I can account for the behavior shown in the logs.

> External Sort - Unable to Allocate Buffer error
> -----------------------------------------------
>
>                 Key: DRILL-5470
>                 URL: https://issues.apache.org/jira/browse/DRILL-5470
>             Project: Apache Drill
>          Issue Type: Bug
>          Components:  Server
>    Affects Versions: 1.10.0
>         Environment: - ubuntu 14.04
> - r3.8xl (32 CPU/240GB Mem)
> - openjdk version "1.8.0_111"
> - drill 1.10.0 with 8656c83b00f8ab09fb6817e4e9943b2211772541 cherry-picked
>            Reporter: Nathan Butler
>            Assignee: Paul Rogers
>
> Per the mailing list discussion and Rahul's and Paul's suggestion I'm filing this Jira issue. Drill seems to be running out of memory when doing an External Sort. Per Zelaine's suggestion I enabled sort.external.disable_managed in drill-override.conf and in the sqlline session. This caused the query to run for longer but it still would fail with the same message.
> Per Paul's suggestion, I enabled debug logging for the org.apache.drill.exec.physical.impl.xsort.managed package and re-ran the query.
> Here's the initial DEBUG line for ExternalSortBatch for our query:
> bq. 2017-05-03 12:02:56,095 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:15] DEBUG o.a.d.e.p.i.x.m.ExternalSortBatch - Config: memory limit = 10737418240, spill file size = 268435456, spill batch size = 8388608, merge limit = 2147483647, merge batch size = 16777216
> And here's the last DEBUG line before the stack trace:
> bq. 2017-05-03 12:37:44,249 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:4] DEBUG o.a.d.e.p.i.x.m.ExternalSortBatch - Available memory: 10737418240, buffer memory = 10719535268, merge memory = 10707140978
> And the stacktrace:
> {quote}
> 2017-05-03 12:38:02,927 [26f600f1-17b3-d649-51be-2ca0c9bf7606:frag:2:6] INFO  o.a.d.e.p.i.x.m.ExternalSortBatch - User Error Occurred: External Sort encountered an error while spilling to disk (Un
> able to allocate buffer of size 268435456 due to memory limit. Current allocation: 10579849472)
> org.apache.drill.common.exceptions.UserException: RESOURCE ERROR: External Sort encountered an error while spilling to disk
> [Error Id: 5d53c677-0cd9-4c01-a664-c02089670a1c ]
>         at org.apache.drill.common.exceptions.UserException$Builder.build(UserException.java:544) ~[drill-common-1.10.0.jar:1.10.0]
>         at org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.doMergeAndSpill(ExternalSortBatch.java:1447) [drill-java-exec-1.10.0.jar:1.10.0]
>         at org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.mergeAndSpill(ExternalSortBatch.java:1376) [drill-java-exec-1.10.0.jar:1.10.0]
>         at org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.spillFromMemory(ExternalSortBatch.java:1339) [drill-java-exec-1.10.0.jar:1.10.0]
>         at org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.processBatch(ExternalSortBatch.java:831) [drill-java-exec-1.10.0.jar:1.10.0]
>         at org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.loadBatch(ExternalSortBatch.java:618) [drill-java-exec-1.10.0.jar:1.10.0]
>         at org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.load(ExternalSortBatch.java:660) [drill-java-exec-1.10.0.jar:1.10.0]
>         at org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.innerNext(ExternalSortBatch.java:559) [drill-java-exec-1.10.0.jar:1.10.0]
>         at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162) [drill-java-exec-1.10.0.jar:1.10.0]
>         at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:119) [drill-java-exec-1.10.0.jar:1.10.0]
>         at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:109) [drill-java-exec-1.10.0.jar:1.10.0]
>         at org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch.innerNext(StreamingAggBatch.java:137) [drill-java-exec-1.10.0.jar:1.10.0]
>         at org.apache.drill.exec.record.AbstractRecordBatch.next(AbstractRecordBatch.java:162) [drill-java-exec-1.10.0.jar:1.10.0]
>         at org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:104) [drill-java-exec-1.10.0.jar:1.10.0]
>         at org.apache.drill.exec.physical.impl.partitionsender.PartitionSenderRootExec.innerNext(PartitionSenderRootExec.java:144) [drill-java-exec-1.10.0.jar:1.10.0]
>         at org.apache.drill.exec.physical.impl.BaseRootExec.next(BaseRootExec.java:94) [drill-java-exec-1.10.0.jar:1.10.0]
>         at org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:232) [drill-java-exec-1.10.0.jar:1.10.0]
>         at org.apache.drill.exec.work.fragment.FragmentExecutor$1.run(FragmentExecutor.java:226) [drill-java-exec-1.10.0.jar:1.10.0]
>         at java.security.AccessController.doPrivileged(Native Method) [na:1.8.0_111]
>         at javax.security.auth.Subject.doAs(Subject.java:422) [na:1.8.0_111]
>         at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) [hadoop-common-2.7.1.jar:na]
>         at org.apache.drill.exec.work.fragment.FragmentExecutor.run(FragmentExecutor.java:226) [drill-java-exec-1.10.0.jar:1.10.0]
>         at org.apache.drill.common.SelfCleaningRunnable.run(SelfCleaningRunnable.java:38) [drill-common-1.10.0.jar:1.10.0]
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0_111]
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_111]
>         at java.lang.Thread.run(Thread.java:745) [na:1.8.0_111]
> Caused by: org.apache.drill.exec.exception.OutOfMemoryException: Unable to allocate buffer of size 268435456 due to memory limit. Current allocation: 10579849472
>         at org.apache.drill.exec.memory.BaseAllocator.buffer(BaseAllocator.java:220) ~[drill-memory-base-1.10.0.jar:1.10.0]
>         at org.apache.drill.exec.memory.BaseAllocator.buffer(BaseAllocator.java:195) ~[drill-memory-base-1.10.0.jar:1.10.0]
>         at org.apache.drill.exec.vector.VarCharVector.reAlloc(VarCharVector.java:425) ~[vector-1.10.0.jar:1.10.0]
>         at org.apache.drill.exec.vector.VarCharVector.copyFromSafe(VarCharVector.java:278) ~[vector-1.10.0.jar:1.10.0]
>         at org.apache.drill.exec.vector.NullableVarCharVector.copyFromSafe(NullableVarCharVector.java:379) ~[vector-1.10.0.jar:1.10.0]
>         at org.apache.drill.exec.test.generated.PriorityQueueCopierGen140.doCopy(PriorityQueueCopierTemplate.java:22) ~[na:na]
>         at org.apache.drill.exec.test.generated.PriorityQueueCopierGen140.next(PriorityQueueCopierTemplate.java:76) ~[na:na]
>         at org.apache.drill.exec.physical.impl.xsort.managed.CopierHolder$BatchMerger.next(CopierHolder.java:234) ~[drill-java-exec-1.10.0.jar:1.10.0]
>         at org.apache.drill.exec.physical.impl.xsort.managed.ExternalSortBatch.doMergeAndSpill(ExternalSortBatch.java:1408) [drill-java-exec-1.10.0.jar:1.10.0]
>         ... 24 common frames omitted
> {quote}
> I'm in communication with Paul and will send him the full log file.
> Thanks,
> Nathan



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)