You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tajo.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/02/12 06:41:18 UTC

[jira] [Commented] (TAJO-2030) Use list S3 files using AmazonS3Client instead of using S3A

    [ https://issues.apache.org/jira/browse/TAJO-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15144088#comment-15144088 ] 

ASF GitHub Bot commented on TAJO-2030:
--------------------------------------

Github user blrunner commented on the pull request:

    https://github.com/apache/tajo/pull/932#issuecomment-183187420
  
    This patch depends on hadoop-aws. I'm going to implement it afresh after resolving https://github.com/apache/tajo/pull/953. 


> Use list S3 files using AmazonS3Client instead of using S3A
> -----------------------------------------------------------
>
>                 Key: TAJO-2030
>                 URL: https://issues.apache.org/jira/browse/TAJO-2030
>             Project: Tajo
>          Issue Type: Sub-task
>          Components: S3
>            Reporter: Jaehwa Jung
>            Assignee: Jaehwa Jung
>             Fix For: 0.12.0
>
>
> AWS S3 provides bulk listing API. It takes the common prefix of all input paths as a parameter and returns all the objects whose prefixes start with the common prefix in blocks of 1000.
> If we will use AmazonS3Client for listing S3 files instead of using S3A, this will improve performance. To prove this idea, I adopted PrestoFileSystem instead of S3AFileSystem. When pruning partition filters, PrestoFileSystem was faster much more than S3AFileSystem.
> Here is my benchmark results for the following queries:
> {code}
> 1 partition : select count(*) from lineitem where l_shipdate = '1992-01-02';
> 30 partitions: select count(*) from lineitem  where l_shipdate > '1992-01-01' and l_shipdate < '1992-02-01';
> 90 partitions: select count(*) from lineitem  where l_shipdate >= '1992-01-01' and l_shipdate < '1992-04-01';
> 151 partitions: select count(*) from lineitem where l_shipdate >= '1992-01-01' and l_shipdate < '1992-06-01';
> {code}
> || (#) of partitions||PrestoFileSystem(ms)||S3AFileSystem(ms)||
> |1|677|800|
> |30|2753|6977|
> |90|6825|13772|
> |151|13834|25701|
> For the reference, I used tpc-h 1g dataset and set {{l_shipdate}} column of {{lineitem}} table to partition column.
> I think there are ways to resolve this as following:
> - Borrow PrestoFileSystem and related codes from Presto
> - Implement necessary codes to S3TableSpace by referencing Presto



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)