You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@tajo.apache.org by "Jaehwa Jung (JIRA)" <ji...@apache.org> on 2015/11/12 08:58:10 UTC

[jira] [Commented] (TAJO-1974) When calculating partitioned table volume, avoid to list partition directories.

    [ https://issues.apache.org/jira/browse/TAJO-1974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001814#comment-15001814 ] 

Jaehwa Jung commented on TAJO-1974:
-----------------------------------

Hi folks,

I tried to calculate partitioned table volume using the bytes of TableStats in PhysicalOperator. But I found that the bytes might be wrong occasionally. In my test examples, when applying compression to table, the volume of listing directories was different from the summary of TableStats volume. You can see above situation at TestTablePartitions using following codes:
https://github.com/blrunner/tajo/commit/a90e7272b4ee22abf384b7f85e6835f075ca58c3

As a result, I think we can't avoid to list partition directories using HDFS api. But it can be improved by separating the api call to PhysicalOperator or another class. Could you give any advice to me for resolving this issue?

> When calculating partitioned table volume, avoid to list partition directories.
> -------------------------------------------------------------------------------
>
>                 Key: TAJO-1974
>                 URL: https://issues.apache.org/jira/browse/TAJO-1974
>             Project: Tajo
>          Issue Type: Improvement
>          Components: Physical Operator, QueryMaster
>    Affects Versions: 0.12.0
>            Reporter: Jaehwa Jung
>            Assignee: Jaehwa Jung
>             Fix For: 0.12.0
>
>
> Currently, after storing the data of partitioned table, Tajo calculates the volume of table using listing partition directories. To list directories, Tajo use FileSystem::getContentSummary of HDFS generic APIs. 
> In case of small to medium-size partition directories, it should not be a problem. But in case of large-size partition directories, it should be a problem. For example, three years of data, organized into hourly directories, results in  26,280 directories. If each directory contains 5 files, this will makes a grand total of 131,400 files. It seems to be a medium deal in HDFS, but it might results in very poor performance in S3. Thus we need to avoid to list partition directories. 
> I think we can get the volume of each partition directories in PhysicalOperator. If all tasks set the volume of partition, Query doesn’t need to list partition directories using HDFS api.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)