You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tajo.apache.org by "Jaehwa Jung (JIRA)" <ji...@apache.org> on 2015/12/02 10:12:10 UTC

[jira] [Updated] (TAJO-1974) When calculating partitioned table volume, avoid to list partition directories.

     [ https://issues.apache.org/jira/browse/TAJO-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jaehwa Jung updated TAJO-1974:
------------------------------
    Fix Version/s:     (was: 0.11.1)

> When calculating partitioned table volume, avoid to list partition directories.
> -------------------------------------------------------------------------------
>
>                 Key: TAJO-1974
>                 URL: https://issues.apache.org/jira/browse/TAJO-1974
>             Project: Tajo
>          Issue Type: Improvement
>          Components: Physical Operator, QueryMaster
>    Affects Versions: 0.12.0
>            Reporter: Jaehwa Jung
>            Assignee: Jaehwa Jung
>             Fix For: 0.12.0
>
>
> Currently, after storing the data of partitioned table, Tajo calculates the volume of table using listing partition directories. To list directories, Tajo use FileSystem::getContentSummary of HDFS generic APIs. 
> In case of small to medium-size partition directories, it should not be a problem. But in case of large-size partition directories, it should be a problem. For example, three years of data, organized into hourly directories, results in  26,280 directories. If each directory contains 5 files, this will makes a grand total of 131,400 files. It seems to be a medium deal in HDFS, but it might results in very poor performance in S3. Thus we need to avoid to list partition directories. 
> I think we can get the volume of each partition directories in PhysicalOperator. If all tasks set the volume of partition, Query doesn’t need to list partition directories using HDFS api.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)