You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2017/02/15 17:35:41 UTC

[jira] [Created] (DRILL-5266) Parquet Reader produces "low density" record batches

Paul Rogers created DRILL-5266:
----------------------------------

             Summary: Parquet Reader produces "low density" record batches
                 Key: DRILL-5266
                 URL: https://issues.apache.org/jira/browse/DRILL-5266
             Project: Apache Drill
          Issue Type: Bug
          Components: Storage - Parquet
    Affects Versions: 1.10
            Reporter: Paul Rogers


Testing with the managed sort revealed that, for at least one file, Parquet produces "low-density" batches: batches in which only 5% of each value vector contains actual data, with the rest being unused space. When fed into the sort, we end up buffering 95% of wasted space, using only 5% of available memory to hold actual query data. The result is poor performance of the sort as it must spill far more frequently than expected.

The managed sort analyzes incoming batches to prepare good memory use estimates. The following the the output from the Parquet file in question:

{code}
Actual batch schema & sizes {
  T1¦¦cs_sold_date_sk(std col. size: 4, actual col. size: 4, total size: 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
  T1¦¦cs_sold_time_sk(std col. size: 4, actual col. size: 4, total size: 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
  T1¦¦cs_ship_date_sk(std col. size: 4, actual col. size: 4, total size: 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
...
  c_email_address(std col. size: 54, actual col. size: 27, total size: 53248, vector size: 49152, data size: 30327, row capacity: 4095, density: 62)
  Records: 1129, Total size: 32006144, Row width:28350, Density:5}
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)