You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tajo.apache.org by blrunner <gi...@git.apache.org> on 2016/02/01 03:11:35 UTC
[GitHub] tajo pull request: TAJO-2069: Remove getContentsSummary in TableSp...
GitHub user blrunner opened a pull request:
https://github.com/apache/tajo/pull/953
TAJO-2069: Remove getContentsSummary in TableSpace and Query.
Not yet implemented unit test cases and it depends on TAJO-2063.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/blrunner/tajo TAJO-2069
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/tajo/pull/953.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #953
----
commit 145b8b209cee134ffdf548f2cf197cf97cf8c420
Author: JaeHwa Jung <bl...@apache.org>
Date: 2016-02-01T01:34:01Z
TAJO-2063: Refactor FileTablespace::commitOutputData.
commit 61a0c68aea40c5065b6ea17502761981ef0ebc91
Author: JaeHwa Jung <bl...@apache.org>
Date: 2016-02-01T01:46:19Z
Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tajo into TAJO-2063
commit cdcbc54fc182c850c98341935c4bfc3933f4ba76
Author: JaeHwa Jung <bl...@apache.org>
Date: 2016-02-01T01:58:03Z
Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tajo into TAJO-2063
commit 17b3204efe0b427ab2818e8802b8f8cc965ab02b
Author: JaeHwa Jung <bl...@apache.org>
Date: 2016-02-01T02:08:54Z
TAJO-2069: Remove getContentsSummary in TableSpace and Query.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] tajo pull request: TAJO-2069: Implement finding the total size of ...
Posted by blrunner <gi...@git.apache.org>.
Github user blrunner commented on the pull request:
https://github.com/apache/tajo/pull/953#issuecomment-179272659
Removed TAJO-2063 dependency.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] tajo pull request: TAJO-2069: Implement finding the total size of ...
Posted by blrunner <gi...@git.apache.org>.
Github user blrunner closed the pull request at:
https://github.com/apache/tajo/pull/953
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] tajo pull request: TAJO-2069: Implement finding the total size of ...
Posted by blrunner <gi...@git.apache.org>.
Github user blrunner commented on the pull request:
https://github.com/apache/tajo/pull/953#issuecomment-179664173
@jihoonson
There may be various reasons : local network connection, and the health of Amazon's servers, AWS SDK retry mechanism.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] tajo pull request: TAJO-2069: Implement finding the total size of ...
Posted by blrunner <gi...@git.apache.org>.
Github user blrunner commented on the pull request:
https://github.com/apache/tajo/pull/953#issuecomment-182708569
@jinossy
I removed hadoop-aws dependency and added Amazon SDK dependency.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] tajo pull request: TAJO-2069: Implement finding the total size of ...
Posted by jinossy <gi...@git.apache.org>.
Github user jinossy commented on a diff in the pull request:
https://github.com/apache/tajo/pull/953#discussion_r51837592
--- Diff: tajo-storage/tajo-storage-s3/pom.xml ---
@@ -167,11 +168,44 @@
</exclusion>
</exclusions>
</dependency>
+
+ <dependency>
+ <groupId>org.apache.hadoop</groupId>
+ <artifactId>hadoop-aws</artifactId>
--- End diff --
hadoop-aws is included in 2.6.0 and higher
If you add hadoop-aws, We should discuss hadoop compatibility
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] tajo pull request: TAJO-2069: Implement finding the total size of ...
Posted by blrunner <gi...@git.apache.org>.
Github user blrunner commented on the pull request:
https://github.com/apache/tajo/pull/953#issuecomment-180208144
@jihoonson
Thank you for your feedback. I'll test it again with reference to your comments.
@jinossy
That's a good point. I'll write a e-mail about Hadoop compatibility.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] tajo pull request: TAJO-2069: Implement finding the total size of ...
Posted by blrunner <gi...@git.apache.org>.
Github user blrunner commented on the pull request:
https://github.com/apache/tajo/pull/953#issuecomment-201121329
Finished test successfully as following:
* EC2 instances which have been deployed manually.
* EMR cluster using below script
```
aws emr create-cluster \
--name="<CLUSTER_NAME> \
--release-label=emr-4.4.0 \
--no-auto-terminate \
--use-default-roles \
--ec2-attributes KeyName=<KEY_NAME> \
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=c3.xlarge \
--bootstrap-action Name="Install tajo",Path=s3://jhjung-us/tajo-emr/install-tajo-java8.py,Args=["-t","s3://jhjung-us/tajo-emr/tajo-0.12.0-SNAPSHOT.tar.gz","-c","s3://tajo-emr/tajo-0.11.0/c3.xlarge/conf"]
```
* Access to S3 directly on OSX
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] tajo pull request: TAJO-2069: Implement finding the total size of ...
Posted by jihoonson <gi...@git.apache.org>.
Github user jihoonson commented on the pull request:
https://github.com/apache/tajo/pull/953#issuecomment-179447059
I wonder why the time taken by getTotalSize() is not proportional to the number of directories. It shows faster speed for more directories sometimes.
Do you know the reason?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] tajo pull request: TAJO-2069: Implement finding the total size of ...
Posted by jihoonson <gi...@git.apache.org>.
Github user jihoonson commented on the pull request:
https://github.com/apache/tajo/pull/953#issuecomment-179682831
If they are reasons, you can mitigate those overheads by testing several times and averaging the results.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] tajo pull request: TAJO-2069: Implement finding the total size of ...
Posted by blrunner <gi...@git.apache.org>.
Github user blrunner commented on the pull request:
https://github.com/apache/tajo/pull/953#issuecomment-219904434
This PR had been moved to https://github.com/apache/tajo/pull/1024.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] tajo pull request: TAJO-2069: Implement finding the total size of ...
Posted by blrunner <gi...@git.apache.org>.
Github user blrunner commented on the pull request:
https://github.com/apache/tajo/pull/953#issuecomment-182739215
@jihoonson
Here is my second benchmark results as follows.
#of directories | S3AFileSystem | S3FileTableSpace | Improvement
-------------------|----------------------|--------------------------|-------------------
5 | 1056.5 ms | 136.2 ms | 7.8x
365 | 56549 ms | 153.8 ms | 367.7x
730 | 113007.5 ms | 193.2 ms | 585x
1095 | 168567 ms | 215.7 ms | 781.5x
1460 | 228129.5 ms | 234.2 ms | 974.1x
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---
[GitHub] tajo pull request: TAJO-2069: Implement finding the total size of ...
Posted by blrunner <gi...@git.apache.org>.
Github user blrunner commented on the pull request:
https://github.com/apache/tajo/pull/953#issuecomment-179264732
Here is my benchmark results as follows.
# Configuration
* EC2 instance type : c3.xlarge
* Tajo version : 0.12.0-SNAPSHOT with TAJO-2030 and TAJO-2069
* Cluster: 1 master, 1 worker
# Contents summary time
#of partitions | S3AFileSystem::getContentsSummary | S3FileTableSpace::getTotalSize | Improvement
-------------------|----------------------|--------------------------|-------------------
5 | 1372 ms | 17 ms | 80.7x
365 | 55447 ms | 120 ms | 462.0x
730 | 110245 ms | 101 ms | 1091.5x
1095 | 164812 ms | 222 ms | 742.4x
1460 | 221492 ms | 217 ms | 1020.7x
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---