You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tajo.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/01/07 07:37:39 UTC
[jira] [Commented] (TAJO-2030) Use list S3 files using
AmazonS3Client instead of using S3A
[ https://issues.apache.org/jira/browse/TAJO-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15086919#comment-15086919 ]
ASF GitHub Bot commented on TAJO-2030:
--------------------------------------
GitHub user blrunner opened a pull request:
https://github.com/apache/tajo/pull/932
TAJO-2030: Use list S3 files using AmazonS3Client instead of using S3A.
The code for S3 bulk listing is fully implemented in ``TajoS3FileSystem``. Honestly, my code is heavily based on ``PrestoS3FileSystem``. And ``TajoS3FileSystem`` extends ``S3AFileSystem`` because ``PrestoS3FileSystem`` doesn't support some methods for file writing, for example, ``FileSystem::mkdir``.
Here is my benchmark results as follows.
# Configuration
* EC2 instance type : c3.xlarge
* Tajo version : 0.12.0-SNAPSHOT
* Cluster: 1 master, 1 worker
* partitions had been generated by Hive
# Queries
```
1 partition: select count(*) from lineitem where l_shipdate = '1992-01-02';
30 partitions: select count(*) from lineitem where l_shipdate > '1992-01-01' and l_shipdate < '1992-02-01';
90 partitions: select count(*) from lineitem where l_shipdate >= '1992-01-01' and l_shipdate < '1992-04-01';
151 partitions: select count(*) from lineitem where l_shipdate >= '1992-01-01' and l_shipdate < '1992-06-01';
334 partitions: select count(*) from lineitem where l_shipdate >= '1992-01-01' and l_shipdate < '1992-12-01';
```
# Results : Partition Pruning
#of partitions | S3AFileSystem | TajoS3FileSystem | Improvement
-------------------|----------------------|--------------------------|-------------------
1 | 1088 ms | 607 ms | 1.79x
30 | 5421 ms | 3414 ms | 1.58x
90 | 15776 ms | 7927 ms | 1.99x
151 | 24060 ms | 14912 ms | 1.61x
334 | 45397 ms | 32247 ms | 1.40x
# Results : Query Finished time
#of partitions | S3AFileSystem | TajoS3FileSystem | Improvement
-------------------|----------------------|--------------------------|-------------------
1 | 3.99 sec | 2.726 sec | 1.46x
30 | 15.447 sec | 12.416 sec | 1.24
90 | 40.153 sec | 31.593 sec | 1.27x
151 | 66.038 sec | 44.604 sec | 1.48x
334 | 137.137 sec | 90.419 sec | 1.51x
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/blrunner/tajo TAJO-2030
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/tajo/pull/932.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #932
----
commit 1a90ad1688e763b7a1c52bdf7600c206094a10b6
Author: JaeHwa Jung <bl...@apache.org>
Date: 2016-01-07T05:10:07Z
TAJO-2030: Use list S3 files using AmazonS3Client instead of using S3A.
----
> Use list S3 files using AmazonS3Client instead of using S3A
> -----------------------------------------------------------
>
> Key: TAJO-2030
> URL: https://issues.apache.org/jira/browse/TAJO-2030
> Project: Tajo
> Issue Type: Sub-task
> Components: S3
> Reporter: Jaehwa Jung
> Assignee: Jaehwa Jung
> Fix For: 0.12.0
>
>
> AWS S3 provides bulk listing API. It takes the common prefix of all input paths as a parameter and returns all the objects whose prefixes start with the common prefix in blocks of 1000.
> If we will use AmazonS3Client for listing S3 files instead of using S3A, this will improve performance. To prove this idea, I adopted PrestoFileSystem instead of S3AFileSystem. When pruning partition filters, PrestoFileSystem was faster much more than S3AFileSystem.
> Here is my benchmark results for the following queries:
> {code}
> 1 partition : select count(*) from lineitem where l_shipdate = '1992-01-02';
> 30 partitions: select count(*) from lineitem where l_shipdate > '1992-01-01' and l_shipdate < '1992-02-01';
> 90 partitions: select count(*) from lineitem where l_shipdate >= '1992-01-01' and l_shipdate < '1992-04-01';
> 151 partitions: select count(*) from lineitem where l_shipdate >= '1992-01-01' and l_shipdate < '1992-06-01';
> {code}
> || (#) of partitions||PrestoFileSystem(ms)||S3AFileSystem(ms)||
> |1|677|800|
> |30|2753|6977|
> |90|6825|13772|
> |151|13834|25701|
> For the reference, I used tpc-h 1g dataset and set {{l_shipdate}} column of {{lineitem}} table to partition column.
> I think there are ways to resolve this as following:
> - Borrow PrestoFileSystem and related codes from Presto
> - Implement necessary codes to S3TableSpace by referencing Presto
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)