You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2017/02/15 17:35:41 UTC
[jira] [Created] (DRILL-5266) Parquet Reader produces "low density"
record batches
Paul Rogers created DRILL-5266:
----------------------------------
Summary: Parquet Reader produces "low density" record batches
Key: DRILL-5266
URL: https://issues.apache.org/jira/browse/DRILL-5266
Project: Apache Drill
Issue Type: Bug
Components: Storage - Parquet
Affects Versions: 1.10
Reporter: Paul Rogers
Testing with the managed sort revealed that, for at least one file, Parquet produces "low-density" batches: batches in which only 5% of each value vector contains actual data, with the rest being unused space. When fed into the sort, we end up buffering 95% of wasted space, using only 5% of available memory to hold actual query data. The result is poor performance of the sort as it must spill far more frequently than expected.
The managed sort analyzes incoming batches to prepare good memory use estimates. The following the the output from the Parquet file in question:
{code}
Actual batch schema & sizes {
T1¦¦cs_sold_date_sk(std col. size: 4, actual col. size: 4, total size: 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
T1¦¦cs_sold_time_sk(std col. size: 4, actual col. size: 4, total size: 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
T1¦¦cs_ship_date_sk(std col. size: 4, actual col. size: 4, total size: 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
...
c_email_address(std col. size: 54, actual col. size: 27, total size: 53248, vector size: 49152, data size: 30327, row capacity: 4095, density: 62)
Records: 1129, Total size: 32006144, Row width:28350, Density:5}
{code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)