You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@impala.apache.org by "Tim Armstrong (JIRA)" <ji...@apache.org> on 2017/11/22 19:13:00 UTC

[jira] [Resolved] (IMPALA-5307) Consider always copying-out Disk I/O buffers instead of attaching to RowBatches

     [ https://issues.apache.org/jira/browse/IMPALA-5307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Armstrong resolved IMPALA-5307.
-----------------------------------
       Resolution: Fixed
    Fix Version/s: Impala 2.11.0

> Consider always copying-out Disk I/O buffers instead of attaching to RowBatches
> -------------------------------------------------------------------------------
>
>                 Key: IMPALA-5307
>                 URL: https://issues.apache.org/jira/browse/IMPALA-5307
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Backend
>            Reporter: Tim Armstrong
>            Assignee: Tim Armstrong
>              Labels: resource-management
>             Fix For: Impala 2.11.0
>
>
> IMPALA-4835 would be greatly simplified if we don't have to attach disk I/O buffers to RowBatches and handle the resultant complexity.
> Disk I/O buffers currently need to be attached to RowBatches if the row batches directly reference var-len data in the buffer. The cases when this can occur are as follows:
> * The column being read contains strings
> * The string data is not dictionary encoded in Parquet (since we copy out the dictionary data in Parquet)
> * The string data is not compressed with a general-purpose compression algorithm (GZip, snappy, etc).
> This includes the following cases: plain-encoded strings in uncompressed Parquet; any strings in uncompressed text, RCFile, Avro, or sequence file.
> In those cases the copy avoidance could provide some performance benefits. However it's unclear that any of those file formats are/should be used in performance-critical use cases, because the storage density of uncompressed strings is almost always terrible.
> We should evaluate the performance impact of the additional copies, but I suspect that it is not severe and does not impact any important use cases.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)