You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@impala.apache.org by "Tim Armstrong (JIRA)" <ji...@apache.org> on 2017/05/22 14:53:04 UTC

[jira] [Resolved] (IMPALA-5304) Parquet scanner transfers decompression buffers when not needed

     [ https://issues.apache.org/jira/browse/IMPALA-5304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Armstrong resolved IMPALA-5304.
-----------------------------------
       Resolution: Fixed
    Fix Version/s: Impala 2.9.0


IMPALA-5304: reduce transfer of Parquet decompression buffers

The buffers contain the Parquet DataPages, which need to be
attached to the row batch if the rows point to var-len data
stored directly in the page. Otherwise the buffers can be
discarded once the values in the page have been materialized.

This reduces the amount of memory transferred between threads, which is
a known TCMalloc anti-pattern. It also allows us to free memory
earlier, which may help reduce memory consumption slightly.

Also fix a latent bug I noticed where needs_conversion_ is not
always initialised in the constructor.

Testing
Ran exhaustive build. Most of the Parquet tests use compressed Parquet,
which should exercise this code path.

Change-Id: I2dbd749f43078b222ff8e1ddcec840986c466de6
Reviewed-on: http://gerrit.cloudera.org:8080/6876
Reviewed-by: Tim Armstrong <ta...@cloudera.com>
Tested-by: Impala Public Jenkins
---

> Parquet scanner transfers decompression buffers when not needed
> ---------------------------------------------------------------
>
>                 Key: IMPALA-5304
>                 URL: https://issues.apache.org/jira/browse/IMPALA-5304
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>    Affects Versions: Impala 2.9.0
>            Reporter: Tim Armstrong
>            Assignee: Tim Armstrong
>              Labels: perf, resource-management
>             Fix For: Impala 2.9.0
>
>
> The Parquet scanner always transfers decompression buffers to the scratch batch:
> {code}
> Status BaseScalarColumnReader::ReadDataPage() {
>   // We're about to move to the next data page.  The previous data page is
>   // now complete, pass along the memory allocated for it.
>   parent_->scratch_batch_->mem_pool()->AcquireData(decompressed_data_pool_.get(), false);
> {code}
> These in turn are passed along with the row batch. This is safe but unnecessary in many cases where the batch does not hold pointers into the decompression buffer: if the column has only fixed-length data, or if the data page is dictionary-encoded.
> This can make problems like IMPALA-4923 worse than they would be otherwise because extra data is transferred across threads.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)