You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tajo.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/01/07 07:37:39 UTC
[jira] [Commented] (TAJO-2030) Use list S3 files using AmazonS3Client instead of using S3A

    [ https://issues.apache.org/jira/browse/TAJO-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15086919#comment-15086919 ] 

ASF GitHub Bot commented on TAJO-2030:
--------------------------------------

GitHub user blrunner opened a pull request:

    https://github.com/apache/tajo/pull/932

    TAJO-2030: Use list S3 files using AmazonS3Client instead of using S3A.

    The code for S3 bulk listing is fully implemented in ``TajoS3FileSystem``. Honestly, my code is heavily based on ``PrestoS3FileSystem``. And ``TajoS3FileSystem`` extends ``S3AFileSystem`` because ``PrestoS3FileSystem`` doesn't support some methods for file writing, for example, ``FileSystem::mkdir``.
    
    Here is my benchmark results as follows.
    
    # Configuration
    
    * EC2 instance type : c3.xlarge
    * Tajo version : 0.12.0-SNAPSHOT
    * Cluster: 1 master, 1 worker
    * partitions had been generated by Hive
    
    
    # Queries
    
    ```
    1 partition: select count(*) from lineitem where l_shipdate = '1992-01-02';
    30 partitions: select count(*) from lineitem  where l_shipdate > '1992-01-01' and l_shipdate < '1992-02-01';
    90 partitions: select count(*) from lineitem  where l_shipdate >= '1992-01-01' and l_shipdate < '1992-04-01';
    151 partitions: select count(*) from lineitem where l_shipdate >= '1992-01-01' and l_shipdate < '1992-06-01';
    334 partitions: select count(*) from lineitem where l_shipdate >= '1992-01-01' and l_shipdate < '1992-12-01';
    ```
    
    # Results : Partition Pruning
    
    #of partitions | S3AFileSystem | TajoS3FileSystem | Improvement
    -------------------|----------------------|--------------------------|-------------------
    1 | 1088 ms | 607 ms | 1.79x
    30 | 5421 ms | 3414 ms |  1.58x
    90 | 15776 ms | 7927 ms | 1.99x
    151 | 24060 ms | 14912 ms | 1.61x
    334 | 45397 ms | 32247 ms | 1.40x
    
    # Results : Query Finished time
    
    #of partitions | S3AFileSystem | TajoS3FileSystem | Improvement
    -------------------|----------------------|--------------------------|-------------------
    1 | 3.99 sec |  2.726 sec | 1.46x
    30 | 15.447 sec | 12.416 sec | 1.24
    90 | 40.153 sec | 31.593 sec | 1.27x
    151 | 66.038 sec | 44.604 sec | 1.48x
    334 | 137.137 sec | 90.419 sec | 1.51x

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/blrunner/tajo TAJO-2030

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/tajo/pull/932.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #932
    
----
commit 1a90ad1688e763b7a1c52bdf7600c206094a10b6
Author: JaeHwa Jung <bl...@apache.org>
Date:   2016-01-07T05:10:07Z

    TAJO-2030: Use list S3 files using AmazonS3Client instead of using S3A.

----


> Use list S3 files using AmazonS3Client instead of using S3A
> -----------------------------------------------------------
>
>                 Key: TAJO-2030
>                 URL: https://issues.apache.org/jira/browse/TAJO-2030
>             Project: Tajo
>          Issue Type: Sub-task
>          Components: S3
>            Reporter: Jaehwa Jung
>            Assignee: Jaehwa Jung
>             Fix For: 0.12.0
>
>
> AWS S3 provides bulk listing API. It takes the common prefix of all input paths as a parameter and returns all the objects whose prefixes start with the common prefix in blocks of 1000.
> If we will use AmazonS3Client for listing S3 files instead of using S3A, this will improve performance. To prove this idea, I adopted PrestoFileSystem instead of S3AFileSystem. When pruning partition filters, PrestoFileSystem was faster much more than S3AFileSystem.
> Here is my benchmark results for the following queries:
> {code}
> 1 partition : select count(*) from lineitem where l_shipdate = '1992-01-02';
> 30 partitions: select count(*) from lineitem  where l_shipdate > '1992-01-01' and l_shipdate < '1992-02-01';
> 90 partitions: select count(*) from lineitem  where l_shipdate >= '1992-01-01' and l_shipdate < '1992-04-01';
> 151 partitions: select count(*) from lineitem where l_shipdate >= '1992-01-01' and l_shipdate < '1992-06-01';
> {code}
> || (#) of partitions||PrestoFileSystem(ms)||S3AFileSystem(ms)||
> |1|677|800|
> |30|2753|6977|
> |90|6825|13772|
> |151|13834|25701|
> For the reference, I used tpc-h 1g dataset and set {{l_shipdate}} column of {{lineitem}} table to partition column.
> I think there are ways to resolve this as following:
> - Borrow PrestoFileSystem and related codes from Presto
> - Implement necessary codes to S3TableSpace by referencing Presto



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)