You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tajo.apache.org by blrunner <gi...@git.apache.org> on 2016/02/01 03:11:35 UTC

[GitHub] tajo pull request: TAJO-2069: Remove getContentsSummary in TableSp...

GitHub user blrunner opened a pull request:

    https://github.com/apache/tajo/pull/953

    TAJO-2069: Remove getContentsSummary in TableSpace and Query.

    Not yet implemented unit test cases and it depends on TAJO-2063.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/blrunner/tajo TAJO-2069

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/tajo/pull/953.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #953
    
----
commit 145b8b209cee134ffdf548f2cf197cf97cf8c420
Author: JaeHwa Jung <bl...@apache.org>
Date:   2016-02-01T01:34:01Z

    TAJO-2063: Refactor FileTablespace::commitOutputData.

commit 61a0c68aea40c5065b6ea17502761981ef0ebc91
Author: JaeHwa Jung <bl...@apache.org>
Date:   2016-02-01T01:46:19Z

    Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tajo into TAJO-2063

commit cdcbc54fc182c850c98341935c4bfc3933f4ba76
Author: JaeHwa Jung <bl...@apache.org>
Date:   2016-02-01T01:58:03Z

    Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/tajo into TAJO-2063

commit 17b3204efe0b427ab2818e8802b8f8cc965ab02b
Author: JaeHwa Jung <bl...@apache.org>
Date:   2016-02-01T02:08:54Z

    TAJO-2069: Remove getContentsSummary in TableSpace and Query.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-2069: Implement finding the total size of ...

Posted by blrunner <gi...@git.apache.org>.
Github user blrunner commented on the pull request:

    https://github.com/apache/tajo/pull/953#issuecomment-179272659
  
    Removed TAJO-2063 dependency. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-2069: Implement finding the total size of ...

Posted by blrunner <gi...@git.apache.org>.
Github user blrunner closed the pull request at:

    https://github.com/apache/tajo/pull/953


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-2069: Implement finding the total size of ...

Posted by blrunner <gi...@git.apache.org>.
Github user blrunner commented on the pull request:

    https://github.com/apache/tajo/pull/953#issuecomment-179664173
  
    @jihoonson 
    
    There may be various reasons : local network connection, and the health of Amazon's servers, AWS SDK  retry mechanism.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-2069: Implement finding the total size of ...

Posted by blrunner <gi...@git.apache.org>.
Github user blrunner commented on the pull request:

    https://github.com/apache/tajo/pull/953#issuecomment-182708569
  
    @jinossy 
    
    I removed hadoop-aws dependency and added Amazon SDK dependency.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-2069: Implement finding the total size of ...

Posted by jinossy <gi...@git.apache.org>.
Github user jinossy commented on a diff in the pull request:

    https://github.com/apache/tajo/pull/953#discussion_r51837592
  
    --- Diff: tajo-storage/tajo-storage-s3/pom.xml ---
    @@ -167,11 +168,44 @@
             </exclusion>
           </exclusions>
         </dependency>
    +
    +    <dependency>
    +      <groupId>org.apache.hadoop</groupId>
    +      <artifactId>hadoop-aws</artifactId>
    --- End diff --
    
    hadoop-aws is included in 2.6.0 and higher
    If you add hadoop-aws, We should discuss hadoop compatibility


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-2069: Implement finding the total size of ...

Posted by blrunner <gi...@git.apache.org>.
Github user blrunner commented on the pull request:

    https://github.com/apache/tajo/pull/953#issuecomment-180208144
  
    @jihoonson 
    Thank you for your feedback. I'll test it again with reference to your comments.
    
    @jinossy 
    That's a good point. I'll write a e-mail about Hadoop compatibility.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-2069: Implement finding the total size of ...

Posted by blrunner <gi...@git.apache.org>.
Github user blrunner commented on the pull request:

    https://github.com/apache/tajo/pull/953#issuecomment-201121329
  
    Finished test successfully as following:
    
    * EC2 instances which have been deployed manually.
    * EMR cluster using below script
    ```
    aws emr create-cluster \
        --name="<CLUSTER_NAME> \
        --release-label=emr-4.4.0 \
        --no-auto-terminate \
        --use-default-roles \
        --ec2-attributes KeyName=<KEY_NAME> \
        --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=c3.xlarge \
        --bootstrap-action Name="Install tajo",Path=s3://jhjung-us/tajo-emr/install-tajo-java8.py,Args=["-t","s3://jhjung-us/tajo-emr/tajo-0.12.0-SNAPSHOT.tar.gz","-c","s3://tajo-emr/tajo-0.11.0/c3.xlarge/conf"]
    ```
    * Access to S3 directly on OSX


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-2069: Implement finding the total size of ...

Posted by jihoonson <gi...@git.apache.org>.
Github user jihoonson commented on the pull request:

    https://github.com/apache/tajo/pull/953#issuecomment-179447059
  
    I wonder why the time taken by getTotalSize() is not proportional to the number of directories. It shows faster speed for more directories sometimes.
    Do you know the reason?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-2069: Implement finding the total size of ...

Posted by jihoonson <gi...@git.apache.org>.
Github user jihoonson commented on the pull request:

    https://github.com/apache/tajo/pull/953#issuecomment-179682831
  
    If they are reasons, you can mitigate those overheads by testing several times and averaging the results.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-2069: Implement finding the total size of ...

Posted by blrunner <gi...@git.apache.org>.
Github user blrunner commented on the pull request:

    https://github.com/apache/tajo/pull/953#issuecomment-219904434
  
    This PR had been moved to https://github.com/apache/tajo/pull/1024.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-2069: Implement finding the total size of ...

Posted by blrunner <gi...@git.apache.org>.
Github user blrunner commented on the pull request:

    https://github.com/apache/tajo/pull/953#issuecomment-182739215
  
    @jihoonson 
    
    Here is my second benchmark results as follows.
    
    #of directories | S3AFileSystem | S3FileTableSpace | Improvement
    -------------------|----------------------|--------------------------|-------------------
    5 | 1056.5 ms | 136.2 ms | 7.8x
    365 | 56549 ms | 153.8 ms | 367.7x
    730 | 113007.5 ms | 193.2 ms | 585x
    1095 | 168567 ms | 215.7 ms | 781.5x
    1460 | 228129.5 ms  | 234.2 ms | 974.1x


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] tajo pull request: TAJO-2069: Implement finding the total size of ...

Posted by blrunner <gi...@git.apache.org>.
Github user blrunner commented on the pull request:

    https://github.com/apache/tajo/pull/953#issuecomment-179264732
  
    Here is my benchmark results as follows.
    
    # Configuration
    
    * EC2 instance type : c3.xlarge
    * Tajo version : 0.12.0-SNAPSHOT with TAJO-2030 and TAJO-2069
    * Cluster: 1 master, 1 worker
    
    # Contents summary time
    
    #of partitions | S3AFileSystem::getContentsSummary | S3FileTableSpace::getTotalSize | Improvement
    -------------------|----------------------|--------------------------|-------------------
    5 | 1372 ms | 17 ms | 80.7x
    365 | 55447 ms | 120 ms | 462.0x
    730 | 110245 ms | 101 ms | 1091.5x
    1095 | 164812 ms | 222 ms | 742.4x
    1460 | 221492 ms  | 217 ms | 1020.7x


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---