You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Csaba Ringhofer (JIRA)" <ji...@apache.org> on 2019/01/15 15:36:00 UTC

[jira] [Commented] (IMPALA-4994) Push conversion and validation into dictionary construction

    [ https://issues.apache.org/jira/browse/IMPALA-4994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16743149#comment-16743149 ] 

Csaba Ringhofer commented on IMPALA-4994:
-----------------------------------------

I have a possible performance concern about CHAR(N): it could make the dictionary much larger if N is large + need a lot of copying. The current solution of treating CHAR(N) the same way as STRING during dictionary construction has the advantage that no copying is necessary, the dictionary will simply contain a pointer to the buffer with the strings. Meanwhile the conversion itself is quite cheap (especially compared to TIMESTAMP...). so we wouldn't win too much by doing it for less elements.

On the other side, doing the conversion during dictionary construction would enable dictionary filtering, which could be a major speed up for some queries.

> Push conversion and validation into dictionary construction
> -----------------------------------------------------------
>
>                 Key: IMPALA-4994
>                 URL: https://issues.apache.org/jira/browse/IMPALA-4994
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>    Affects Versions: Impala 2.9.0
>            Reporter: Joe McDonnell
>            Assignee: Csaba Ringhofer
>            Priority: Major
>              Labels: ramp-up
>
> Certain data types require conversion and/or validation when read from a Parquet file. For example, timestamps can require conversion to account for different storage offsets. Char/varchar fields can require conversion to handle lengths and space padding. Timestamps require validation, because not all bit combinations are valid timestamps.
> Right now, this is done per element as it is read. For dictionary encoded columns, it would save processing to do the conversion/validation once at dictionary construction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org