You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (JIRA)" <ji...@apache.org> on 2017/02/15 21:48:42 UTC

[jira] [Comment Edited] (DRILL-5266) Parquet Reader produces "low density" record batches

    [ https://issues.apache.org/jira/browse/DRILL-5266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15868600#comment-15868600 ] 

Paul Rogers edited comment on DRILL-5266 at 2/15/17 9:48 PM:
-------------------------------------------------------------

An explanation for the whacky column sizing is that the metadata returns sizes in *bits*, while the rest of the code expects sizes to be in *bytes*. The observed row width of 335 * 8 = 2,680, which might explain the fixed field width of 1856 seen earlier: 1856 / 8 = 232, which seems reasonable for the byte width of the fixed columns.

{code}
  private int getDataTypeLength(ColumnDescriptor column, SchemaElement se) {
    ...
        return getTypeLengthInBits(column.getType());
      ...
  }
  private long determineSizesSerial(long recordsToReadInThisPass) throws IOException {
    ...
      // check that the next record will fit in the batch
      if (exitLengthDeterminingLoop ||
          (recordsReadInCurrentPass + 1) * parentReader.getBitWidthAllFixedFields()
              + totalVariableLengthData + lengthVarFieldsInCurrentRecord > parentReader.getBatchSize()) {
{code}

Note above that we use the *bit* length of the field to determine if *bytes* will fit into the current record. Should divide bits by 8 above.



was (Author: paul-rogers):
An explanation for the whacky column sizing is that the metadata returns sizes in *bits*, while the rest of the code expects sizes to be in *bytes*. The observed row width of 335 * 8 = 2,680, which might explain the fixed field width of 1856 seen earlier: 1856 / 8 = 232, which seems reasonable for the byte width of the fixed columns.

{code}
  private int getDataTypeLength(ColumnDescriptor column, SchemaElement se) {
    ...
        return getTypeLengthInBits(column.getType());
      ...
  }
{code}


> Parquet Reader produces "low density" record batches
> ----------------------------------------------------
>
>                 Key: DRILL-5266
>                 URL: https://issues.apache.org/jira/browse/DRILL-5266
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - Parquet
>    Affects Versions: 1.10
>            Reporter: Paul Rogers
>
> Testing with the managed sort revealed that, for at least one file, Parquet produces "low-density" batches: batches in which only 5% of each value vector contains actual data, with the rest being unused space. When fed into the sort, we end up buffering 95% of wasted space, using only 5% of available memory to hold actual query data. The result is poor performance of the sort as it must spill far more frequently than expected.
> The managed sort analyzes incoming batches to prepare good memory use estimates. The following the the output from the Parquet file in question:
> {code}
> Actual batch schema & sizes {
>   T1¦¦cs_sold_date_sk(std col. size: 4, actual col. size: 4, total size: 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
>   T1¦¦cs_sold_time_sk(std col. size: 4, actual col. size: 4, total size: 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
>   T1¦¦cs_ship_date_sk(std col. size: 4, actual col. size: 4, total size: 196608, vector size: 131072, data size: 4516, row capacity: 32768, density: 4)
> ...
>   c_email_address(std col. size: 54, actual col. size: 27, total size: 53248, vector size: 49152, data size: 30327, row capacity: 4095, density: 62)
>   Records: 1129, Total size: 32006144, Row width:28350, Density:5}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)