You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tajo.apache.org by "Jaehwa Jung (JIRA)" <ji...@apache.org> on 2015/11/12 02:53:11 UTC
[jira] [Created] (TAJO-1974) When calculating partitioned table
volume, avoid to list partition directories.
Jaehwa Jung created TAJO-1974:
---------------------------------
Summary: When calculating partitioned table volume, avoid to list partition directories.
Key: TAJO-1974
URL: https://issues.apache.org/jira/browse/TAJO-1974
Project: Tajo
Issue Type: Improvement
Components: Physical Operator, QueryMaster
Affects Versions: 0.12.0
Reporter: Jaehwa Jung
Assignee: Jaehwa Jung
Fix For: 0.12.0
Currently, after storing the data of partitioned table, Tajo calculates the volume of table using listing partition directories. To list directories, Tajo use FileSystem::getContentSummary of HDFS generic APIs.
In case of small to medium-size partition directories, it should not be a problem. But in case of large-size partition directories, it should be a problem. For example, three years of data, organized into hourly directories, results in 26,280 directories. If each directory contains 5 files, this will makes a grand total of 131,400 files. It seems to be a medium deal in HDFS, but it might results in very poor performance in S3. Thus we need to avoid to list partition directories.
I think we can get the volume of each partition directories in PhysicalOperator. If all tasks set the volume of partition, Query doesn’t need to list partition directories using HDFS api.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)