You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Rahul Challapalli (JIRA)" <ji...@apache.org> on 2017/05/04 17:01:04 UTC
[jira] [Commented] (DRILL-5472) Parquet reader generating
low-density batches causing Sort operator to spill un-necessarily
[ https://issues.apache.org/jira/browse/DRILL-5472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15997075#comment-15997075 ]
Rahul Challapalli commented on DRILL-5472:
------------------------------------------
The parquet file is generated using either constants or missing fields
{code}
create table drill5472 as
select
d.map map,
d.map.missing1 missing1,
'hello' as missing2,
true as missing3,
5.888 as missing4,
cast('abcd' as varchar) missing5,
cast('1998-01-01' as date) missing6,
cast(1.1 as decimal(28,2)) missing7,
CAST(456 as CHAR(3)) missing8,
cast('P1Y' as interval year) missing9,
cast('P1D' as interval day) missing10,
cast('P1Y1M1DT1H1M' as interval second) missing11,
CONVERT_FROM('{x:100, y:215.6}' ,'JSON') missing12,
STRING_BINARY(CONVERT_TO(1, 'INT')) missing13,
STRING_BINARY(CONVERT_TO(1, 'INT_BE')) as missing14,
STRING_BINARY(CONVERT_TO(1, 'BIGINT')) as missing15,
STRING_BINARY(CONVERT_TO(1, 'BIGINT')) as missing16,
STRING_BINARY(CONVERT_TO(1, 'INT_HADOOPV')) as missing17,
STRING_BINARY(CONVERT_TO('hello', 'UTF8')) as missing18,
STRING_BINARY(CONVERT_TO('hello', 'UTF16')) missing19,
CONVERT_FROM(BINARY_STRING('\x00\x00\x00\xC8'), 'INT_BE') AS missing20,
CONVERT_FROM(BINARY_STRING('\x00\x00\x00\xC8'), 'INT') AS missing21,
CONVERT_FROM(BINARY_STRING('\xBE\xBA\xFE\xCA'), 'INT_BE') AS missing22,
CONVERT_TO(-1095041334, 'INT_BE') as missing23,
TO_CHAR(1256.789383, '#,###.###') missing24,
TO_CHAR((CAST('2008-2-23' AS DATE)), 'yyyy-MMM-dd') missing25,
CAST('12:20:30' AS TIME) missing26,
CAST('2015-2-23 12:00:00' AS TIMESTAMP) missing27,
TO_DATE('2015-FEB-23', 'yyyy-MMM-dd') missing28,
EXTRACT(year from mydate) `missing 29`,
TO_DATE(1427849046000) missing30,
TO_NUMBER('987,966', '######') missing31,
TO_TIME('12:20:30', 'HH:mm:ss') missing32,
TO_TIMESTAMP('2008-2-23 12:00:00', 'yyyy-MM-dd HH:mm:ss') missing33,
TIMEOFDAY() missing34,
d.map.missingmap.m1 m1
from dfs.`/drill/testdata/resource-manager/nested-large.json` d;
{code}
> Parquet reader generating low-density batches causing Sort operator to spill un-necessarily
> -------------------------------------------------------------------------------------------
>
> Key: DRILL-5472
> URL: https://issues.apache.org/jira/browse/DRILL-5472
> Project: Apache Drill
> Issue Type: Bug
> Components: Execution - Relational Operators, Storage - Parquet
> Reporter: Rahul Challapalli
> Assignee: Paul Rogers
> Attachments: drill5472.log, drill5472.parquet, drill5472.sys.drill
>
>
> git.commit.id.abbrev=1e0a14c
> The parquet file used in the below query is ~20MB. The uncompressed size id ~1.2 GB. Now the below query has a sort which is given ~6GB memory for a single fragment and yet it spills.
> {code}
> select * from (select * from dfs.`/drill/testdata/resource-manager/all_types_large` s order by s.missing12.x) d where d.missing3 is false;
> {code}
> The profile indicates that the above query has spilled twice. Attached the profile and the logs
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)