You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tajo.apache.org by blrunner <gi...@git.apache.org> on 2016/01/07 07:37:29 UTC

[GitHub] tajo pull request: TAJO-2030: Use list S3 files using AmazonS3Clie...

GitHub user blrunner opened a pull request:

    https://github.com/apache/tajo/pull/932

    TAJO-2030: Use list S3 files using AmazonS3Client instead of using S3A.

    The code for S3 bulk listing is fully implemented in ``TajoS3FileSystem``. Honestly, my code is heavily based on ``PrestoS3FileSystem``. And ``TajoS3FileSystem`` extends ``S3AFileSystem`` because ``PrestoS3FileSystem`` doesn't support some methods for file writing, for example, ``FileSystem::mkdir``.
    
    Here is my benchmark results as follows.
    
    # Configuration
    
    * EC2 instance type : c3.xlarge
    * Tajo version : 0.12.0-SNAPSHOT
    * Cluster: 1 master, 1 worker
    * partitions had been generated by Hive
    
    
    # Queries
    
    ```
    1 partition: select count(*) from lineitem where l_shipdate = '1992-01-02';
    30 partitions: select count(*) from lineitem  where l_shipdate > '1992-01-01' and l_shipdate < '1992-02-01';
    90 partitions: select count(*) from lineitem  where l_shipdate >= '1992-01-01' and l_shipdate < '1992-04-01';
    151 partitions: select count(*) from lineitem where l_shipdate >= '1992-01-01' and l_shipdate < '1992-06-01';
    334 partitions: select count(*) from lineitem where l_shipdate >= '1992-01-01' and l_shipdate < '1992-12-01';
    ```
    
    # Results : Partition Pruning
    
    #of partitions | S3AFileSystem | TajoS3FileSystem | Improvement
    -------------------|----------------------|--------------------------|-------------------
    1 | 1088 ms | 607 ms | 1.79x
    30 | 5421 ms | 3414 ms |  1.58x
    90 | 15776 ms | 7927 ms | 1.99x
    151 | 24060 ms | 14912 ms | 1.61x
    334 | 45397 ms | 32247 ms | 1.40x
    
    # Results : Query Finished time
    
    #of partitions | S3AFileSystem | TajoS3FileSystem | Improvement
    -------------------|----------------------|--------------------------|-------------------
    1 | 3.99 sec |  2.726 sec | 1.46x
    30 | 15.447 sec | 12.416 sec | 1.24
    90 | 40.153 sec | 31.593 sec | 1.27x
    151 | 66.038 sec | 44.604 sec | 1.48x
    334 | 137.137 sec | 90.419 sec | 1.51x

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/blrunner/tajo TAJO-2030

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/tajo/pull/932.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #932
    
----
commit 1a90ad1688e763b7a1c52bdf7600c206094a10b6
Author: JaeHwa Jung <bl...@apache.org>
Date:   2016-01-07T05:10:07Z

    TAJO-2030: Use list S3 files using AmazonS3Client instead of using S3A.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-2030: Use list S3 files using AmazonS3Clie...

Posted by blrunner <gi...@git.apache.org>.
Github user blrunner closed the pull request at:

    https://github.com/apache/tajo/pull/932


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-2030: Use list S3 files using AmazonS3Clie...

Posted by blrunner <gi...@git.apache.org>.
Github user blrunner commented on the pull request:

    https://github.com/apache/tajo/pull/932#issuecomment-183187420
  
    This patch depends on hadoop-aws. I'm going to implement it afresh after resolving https://github.com/apache/tajo/pull/953. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---