You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2017/02/15 17:40:42 UTC
[jira] [Commented] (DRILL-5267) Managed external sort spills too often with Parquet data

    [ https://issues.apache.org/jira/browse/DRILL-5267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15868230#comment-15868230 ] 

Paul Rogers commented on DRILL-5267:
------------------------------------

The query for this case is:

{code}
select * from dfs.`/some_file.parquet` order by c_email_address
{code}

The file was created (by another person) as the CTTAS of a join on TPC-H data.

When run locally, using the unmanaged sort, we get the following results: total time (debug mode):

{code}
Results: 1,434,519 records, 4233 batches, 69,294 ms
{code}

The old sort used small spill batches:

{code}
read 339 records
{code}

With the managed sort, we get:

{code}
Actual batch schema & sizes {
  T1¦¦cs_sold_date_sk(std col. size: 4, actual col. size: 4, total size: 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
  T1¦¦cs_sold_time_sk(std col. size: 4, actual col. size: 4, total size: 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
  T1¦¦cs_ship_date_sk(std col. size: 4, actual col. size: 4, total size: 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
...
  c_email_address(std col. size: 54, actual col. size: 27, total size: 53248, vector size: 49152, data size: 30327, row capacity: 4095, density: 62)
  Records: 1129, Total size: 32006144, Row width:28350, Density:5}

Memory Estimates: record size = 335 bytes; input batch = 32006144 bytes, 1129 records; 
merge batch size = 8388608 bytes, 25040 records; 
output batch size = 16777216 bytes, 50081 records; 
Available memory: 2147483648, spill point = 48783360, min. merge memory = 117440512

...
Starting spill from memory. Memory = 2079733760, Buffered batch count = 65, Spill batch count = 8
mergeAndSpill: completed, memory = 2090776608, spilled 9032 records to /tmp/.../spill3
{code}

Let's look at the spill files:

{code}
3,841,038 spill1
3,836,810 spill2
3,834,634 spill3
3,846,039 spill4
{code}

This shows us the impact of the low-density batches. Spill files are supposed to be 256 MB in size. But, we have 2 GB of memory, 5% density, so we can hold only 102 MB of actual data. Need to figure out why the spill files are only 3 MB.

Looking more carefully, the memory estimates say that a merge batch should be 25040 record, but the spill code says it spilled only 9032 records.

Ah, the issue is the darn low-density batches again: we are trying to use the batch size to compute how much to actually spill:

{code}
    long estSize = 0;
    int spillCount = 0;
    for (InputBatch batch : bufferedBatches) {
      estSize += batch.getDataSize();
      if (estSize > spillFileSize) {
        break; }
      spillCount++;
    }
{code}

But, since batches are low-density, batch size is a very poor proxy for actual on-disk size.

With the above fix:

{code}
Input Batch Estimates: record size = 335 bytes; input batch = 32006144 bytes, 1129 records
...
Starting spill from memory. .. Buffered batch count = 65, Spill batch count = 65
...
Results: 1,434,519 records, 31 batches, 37,586 ms

31,187,622 spill1
{code}

Spill files are 10x larger and run time is about half the unmanaged sort. All batches are being spilled. The number of batches used to deliver the output dropped from 4233 to 31. So far so good.

Since each batch is 32 MB, memory hold 32 MB * 65 = 2,080 MB = 2 GB of vectors. But, those vectors, combined, hold only 2 GB / 20 = 100 MB of data. But the file is only 31 MB, so something is still wrong.

To compute this a different way:

{code}
65 batches * 1129 records/batch * 335 bytes/record = 24 MB
{code}

Which does agree (more or less) with file size. This means that density is actually:

{code}
24 MB / 2 GB = 1.2%
{code}

Something is seriously bizarre... This is telling us that of the 2 GB used up to hold batches, only 24 MB of useful data is being stored. This is really bad news!

The sort is now doing the best it can with the batches it has been given. The next challenge is to understand why the batches hold so little data.

> Managed external sort spills too often with Parquet data
> --------------------------------------------------------
>
>                 Key: DRILL-5267
>                 URL: https://issues.apache.org/jira/browse/DRILL-5267
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.10
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>             Fix For: 1.10
>
>
> DRILL-5266 describes how Parquet produces low-density record batches. The result of these batches is that the external sort spills more frequently than it should because it sizes spill files based on batch size, not data content of the batch. Since Parquet batches are 95% empty space, the spill files end up far too small.
> Adjust the spill calculations based on actual data content, not the size of the overall record batch.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)