You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Rahul Challapalli (JIRA)" <ji...@apache.org> on 2017/05/04 17:01:04 UTC
[jira] [Commented] (DRILL-5472) Parquet reader generating low-density batches causing Sort operator to spill un-necessarily

    [ https://issues.apache.org/jira/browse/DRILL-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15997075#comment-15997075 ] 

Rahul Challapalli commented on DRILL-5472:
------------------------------------------

The parquet file is generated using either constants or missing fields
{code}
create table drill5472 as 
select 
      d.map map,
      d.map.missing1 missing1, 
      'hello' as missing2, 
      true as missing3, 
      5.888 as missing4, 
      cast('abcd' as varchar) missing5, 
      cast('1998-01-01' as date) missing6, 
      cast(1.1 as decimal(28,2)) missing7, 
      CAST(456 as CHAR(3)) missing8, 
      cast('P1Y' as interval year) missing9, 
      cast('P1D' as interval day) missing10,
      cast('P1Y1M1DT1H1M' as interval second) missing11,
      CONVERT_FROM('{x:100, y:215.6}' ,'JSON') missing12,
      STRING_BINARY(CONVERT_TO(1, 'INT')) missing13,
      STRING_BINARY(CONVERT_TO(1, 'INT_BE')) as missing14,
      STRING_BINARY(CONVERT_TO(1, 'BIGINT')) as missing15,
      STRING_BINARY(CONVERT_TO(1, 'BIGINT')) as missing16,
      STRING_BINARY(CONVERT_TO(1, 'INT_HADOOPV')) as missing17,
      STRING_BINARY(CONVERT_TO('hello', 'UTF8')) as missing18,
      STRING_BINARY(CONVERT_TO('hello', 'UTF16')) missing19,
      CONVERT_FROM(BINARY_STRING('\x00\x00\x00\xC8'), 'INT_BE') AS missing20,
      CONVERT_FROM(BINARY_STRING('\x00\x00\x00\xC8'), 'INT') AS missing21,
      CONVERT_FROM(BINARY_STRING('\xBE\xBA\xFE\xCA'), 'INT_BE') AS missing22,
      CONVERT_TO(-1095041334, 'INT_BE') as missing23,
      TO_CHAR(1256.789383, '#,###.###') missing24,
      TO_CHAR((CAST('2008-2-23' AS DATE)), 'yyyy-MMM-dd') missing25,
      CAST('12:20:30' AS TIME) missing26,
      CAST('2015-2-23 12:00:00' AS TIMESTAMP) missing27,
      TO_DATE('2015-FEB-23', 'yyyy-MMM-dd') missing28,
      EXTRACT(year from mydate) `missing 29`,
      TO_DATE(1427849046000) missing30,
      TO_NUMBER('987,966', '######') missing31,
      TO_TIME('12:20:30', 'HH:mm:ss') missing32,
      TO_TIMESTAMP('2008-2-23 12:00:00', 'yyyy-MM-dd HH:mm:ss') missing33,
      TIMEOFDAY() missing34,
      d.map.missingmap.m1 m1 
    from dfs.`/drill/testdata/resource-manager/nested-large.json` d;
{code}

> Parquet reader generating low-density batches causing Sort operator to spill un-necessarily
> -------------------------------------------------------------------------------------------
>
>                 Key: DRILL-5472
>                 URL: https://issues.apache.org/jira/browse/DRILL-5472
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Relational Operators, Storage - Parquet
>            Reporter: Rahul Challapalli
>            Assignee: Paul Rogers
>         Attachments: drill5472.log, drill5472.parquet, drill5472.sys.drill
>
>
> git.commit.id.abbrev=1e0a14c
> The parquet file used in the below query is ~20MB. The uncompressed size id ~1.2 GB. Now the below query has a sort which is given ~6GB memory for a single fragment and yet it spills.
> {code}
> select * from (select * from dfs.`/drill/testdata/resource-manager/all_types_large` s order by s.missing12.x) d where d.missing3 is false;
> {code}
> The profile indicates that the above query has spilled twice. Attached the profile and the logs



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)