You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tajo.apache.org by "Jaehwa Jung (JIRA)" <ji...@apache.org> on 2015/11/12 02:53:11 UTC

[jira] [Created] (TAJO-1974) When calculating partitioned table volume, avoid to list partition directories.

Jaehwa Jung created TAJO-1974:
---------------------------------

             Summary: When calculating partitioned table volume, avoid to list partition directories.
                 Key: TAJO-1974
                 URL: https://issues.apache.org/jira/browse/TAJO-1974
             Project: Tajo
          Issue Type: Improvement
          Components: Physical Operator, QueryMaster
    Affects Versions: 0.12.0
            Reporter: Jaehwa Jung
            Assignee: Jaehwa Jung
             Fix For: 0.12.0


Currently, after storing the data of partitioned table, Tajo calculates the volume of table using listing partition directories. To list directories, Tajo use FileSystem::getContentSummary of HDFS generic APIs. 

In case of small to medium-size partition directories, it should not be a problem. But in case of large-size partition directories, it should be a problem. For example, three years of data, organized into hourly directories, results in  26,280 directories. If each directory contains 5 files, this will makes a grand total of 131,400 files. It seems to be a medium deal in HDFS, but it might results in very poor performance in S3. Thus we need to avoid to list partition directories. 

I think we can get the volume of each partition directories in PhysicalOperator. If all tasks set the volume of partition, Query doesn’t need to list partition directories using HDFS api.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)