You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Quanlong Huang (JIRA)" <ji...@apache.org> on 2019/06/14 02:18:00 UTC

[jira] [Updated] (IMPALA-7294) TABLESAMPLE clause allocates arrays based on total file count instead of selected partitions

     [ https://issues.apache.org/jira/browse/IMPALA-7294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Quanlong Huang updated IMPALA-7294:
-----------------------------------
    Fix Version/s: Impala 2.13.0

> TABLESAMPLE clause allocates arrays based on total file count instead of selected partitions
> --------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-7294
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7294
>             Project: IMPALA
>          Issue Type: Bug
>    Affects Versions: Impala 3.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Minor
>             Fix For: Impala 2.13.0, Impala 3.1.0
>
>
> The HdfsTable.getFilesSample function takes a list of input partitions to sample files from, but then, when allocating an array to sample into, sizes that array based on the total file count across all partitions. This is an unnecessarily large array, which is expensive to allocate (may cause full GC when the heap is fragmented). The code claims this to be an optimization:
> {code}
>     // Use max size to avoid looping over inputParts for the exact size.
> {code}
> ...but I think the loop over inputParts is likely to be trivial here since we'll loop over them anyway later in the function and thus will already be pulled into CPU cache, etc. This is also necessary for fine-grained metadata loading in the impalad -- for a large table with many partitions, we don't want to load the file lists of all partitions just to tablesample from one partition.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org