You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by kkhatua <gi...@git.apache.org> on 2018/02/28 22:04:25 UTC

[GitHub] drill pull request #1141: DRILL-6197: Skip duplicate entry for OperatorStats

GitHub user kkhatua opened a pull request:

    https://github.com/apache/drill/pull/1141

    DRILL-6197: Skip duplicate entry for OperatorStats

    `org.apache.drill.exec.ops.FragmentStats` should skip injecting the `org.apache.drill.exec.ops.OperatorStats` instance for these operators:
    ```
    org.apache.drill.exec.proto.beans.CoreOperatorType.SCREEN
    org.apache.drill.exec.proto.beans.CoreOperatorType.SINGLE_SENDER
    org.apache.drill.exec.proto.beans.CoreOperatorType.BROADCAST_SENDER
    org.apache.drill.exec.proto.beans.CoreOperatorType.HASH_PARTITION_SENDER
    ```
    They all use the `org.apache.drill.exec.physical.impl.BaseRootExec` to inject the correct statistics.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/kkhatua/drill DRILL-6197

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/1141.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1141
    
----
commit f61e0416b10ebf9826540bb0bbe7de5d826de029
Author: Kunal Khatua <kk...@...>
Date:   2018-02-28T21:58:09Z

    DRILL-6197: Skip duplicate entry for OperatorStats
    
    org.apache.drill.exec.ops.FragmentStats should skip injecting the org.apache.drill.exec.ops.OperatorStats instance for these operators:
    org.apache.drill.exec.proto.beans.CoreOperatorType.SCREEN
    org.apache.drill.exec.proto.beans.CoreOperatorType.SINGLE_SENDER
    org.apache.drill.exec.proto.beans.CoreOperatorType.BROADCAST_SENDER
    org.apache.drill.exec.proto.beans.CoreOperatorType.HASH_PARTITION_SENDER

----


---

[GitHub] drill pull request #1141: DRILL-6197: Skip duplicate entry for OperatorStats

Posted by amansinha100 <gi...@git.apache.org>.
Github user amansinha100 commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1141#discussion_r171431227
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/ops/FragmentStats.java ---
    @@ -31,6 +32,13 @@
     public class FragmentStats {
     //  private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(FragmentStats.class);
     
    +  //Skip operators that already have stats reported by org.apache.drill.exec.physical.impl.BaseRootExec
    +  private static final List<Integer> operatorStatsInitToSkip = Lists.newArrayList(
    --- End diff --
    
    This could get out of sync with the types of senders that extend BaseRootExec. 


---

[GitHub] drill pull request #1141: DRILL-6197: Skip duplicate entry for OperatorStats

Posted by amansinha100 <gi...@git.apache.org>.
Github user amansinha100 commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1141#discussion_r171723902
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/ops/FragmentStats.java ---
    @@ -79,4 +71,21 @@ public void addOperatorStats(OperatorStats stats) {
         operators.add(stats);
       }
     
    +  //DRILL-6197
    +  public OperatorStats addOrReplaceOperatorStats(OperatorStats stats) {
    +    //Remove existing stat
    +    OperatorStats replacedStat = null;
    +    int index = 0;
    +    for (OperatorStats opStat : operators) {
    --- End diff --
    
    Some TPC-DS queries have fairly long list of operators within a fragment and in general it would be preferable to not do this search.  Can you point to where this Json serialization happens ?  my guess is it just needs to preserve the insertion order.  In that case we could use a LinkedHashSet which would provide both the duplicate removal and keep insertion order. 


---

[GitHub] drill pull request #1141: DRILL-6197: Skip duplicate entry for OperatorStats

Posted by kkhatua <gi...@git.apache.org>.
Github user kkhatua commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1141#discussion_r171767495
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/ops/FragmentStats.java ---
    @@ -79,4 +71,21 @@ public void addOperatorStats(OperatorStats stats) {
         operators.add(stats);
       }
     
    +  //DRILL-6197
    +  public OperatorStats addOrReplaceOperatorStats(OperatorStats stats) {
    +    //Remove existing stat
    +    OperatorStats replacedStat = null;
    +    int index = 0;
    +    for (OperatorStats opStat : operators) {
    --- End diff --
    
    Everything worked fine. Tried doing a join for TPCH tables - `lineitem` and `orders`, and confirmed no more duplicates for SCREEN, SINGLE_SENDER and HASH_PARTITION_SENDER. For a smaller substitute of `orders` with `supplier` ; confirmed that the BROADCAST_SENDER was also not having duplicates..


---

[GitHub] drill issue #1141: DRILL-6197: Skip duplicate entry for OperatorStats

Posted by kkhatua <gi...@git.apache.org>.
Github user kkhatua commented on the issue:

    https://github.com/apache/drill/pull/1141
  
    @amansinha100  please review


---

[GitHub] drill pull request #1141: DRILL-6197: Skip duplicate entry for OperatorStats

Posted by kkhatua <gi...@git.apache.org>.
Github user kkhatua commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1141#discussion_r171648058
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/ops/FragmentStats.java ---
    @@ -79,4 +71,21 @@ public void addOperatorStats(OperatorStats stats) {
         operators.add(stats);
       }
     
    +  //DRILL-6197
    +  public OperatorStats addOrReplaceOperatorStats(OperatorStats stats) {
    +    //Remove existing stat
    +    OperatorStats replacedStat = null;
    +    int index = 0;
    +    for (OperatorStats opStat : operators) {
    --- End diff --
    
    The choice for using a list for the collection of stats seems to be because it simply gets serialized into a JSON list. . As for the overhead, since each list is specific to a minor fragment (which typically has about 3-8 operators), the overhead of doing a linear search is not significant and is invoked only for specific operators. That is one of the reasons why I didn't modify the original `addOperatorStats()` implementation with that of `addOrReplaceOperatorStats()`.


---

[GitHub] drill pull request #1141: DRILL-6197: Skip duplicate entry for OperatorStats

Posted by amansinha100 <gi...@git.apache.org>.
Github user amansinha100 commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1141#discussion_r171753536
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/ops/FragmentStats.java ---
    @@ -79,4 +71,21 @@ public void addOperatorStats(OperatorStats stats) {
         operators.add(stats);
       }
     
    +  //DRILL-6197
    +  public OperatorStats addOrReplaceOperatorStats(OperatorStats stats) {
    +    //Remove existing stat
    +    OperatorStats replacedStat = null;
    +    int index = 0;
    +    for (OperatorStats opStat : operators) {
    --- End diff --
    
    LGTM.  Hopefully it did not break existing stuff..so will wait for your confirmation.   


---

[GitHub] drill pull request #1141: DRILL-6197: Skip duplicate entry for OperatorStats

Posted by amansinha100 <gi...@git.apache.org>.
Github user amansinha100 commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1141#discussion_r171611927
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/ops/FragmentStats.java ---
    @@ -79,4 +71,21 @@ public void addOperatorStats(OperatorStats stats) {
         operators.add(stats);
       }
     
    +  //DRILL-6197
    +  public OperatorStats addOrReplaceOperatorStats(OperatorStats stats) {
    +    //Remove existing stat
    +    OperatorStats replacedStat = null;
    +    int index = 0;
    +    for (OperatorStats opStat : operators) {
    --- End diff --
    
    I am worried about the small overheads of doing this linear search adding up for each operator, especially for queries with complex query plans.  The stats collection should ideally impose minimal overhead.  Does the operator stats have to be a list or can just use a Set ? 


---

[GitHub] drill pull request #1141: DRILL-6197: Skip duplicate entry for OperatorStats

Posted by kkhatua <gi...@git.apache.org>.
Github user kkhatua commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1141#discussion_r171740245
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/ops/FragmentStats.java ---
    @@ -79,4 +71,21 @@ public void addOperatorStats(OperatorStats stats) {
         operators.add(stats);
       }
     
    +  //DRILL-6197
    +  public OperatorStats addOrReplaceOperatorStats(OperatorStats stats) {
    +    //Remove existing stat
    +    OperatorStats replacedStat = null;
    +    int index = 0;
    +    for (OperatorStats opStat : operators) {
    --- End diff --
    
    I see your point. Also, digging into the code shows I can substitute with a LinkedHashMap, since the list is only referenced here for consumption of its contents:
    https://github.com/kkhatua/drill/blob/65efe3ea0c5777490488d3d56cbdb0cb011b9f33/exec/java-exec/src/main/java/org/apache/drill/exec/ops/FragmentStats.java#L45
    
    I can't use a Set, because I need the Stats object hashed on the operator ID & Type, and not the rest of the contents. I'll refactor and try to confirm nothing else breaks.


---

[GitHub] drill pull request #1141: DRILL-6197: Skip duplicate entry for OperatorStats

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/drill/pull/1141


---

[GitHub] drill pull request #1141: DRILL-6197: Skip duplicate entry for OperatorStats

Posted by kkhatua <gi...@git.apache.org>.
Github user kkhatua commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1141#discussion_r171485412
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/ops/FragmentStats.java ---
    @@ -31,6 +32,13 @@
     public class FragmentStats {
     //  private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(FragmentStats.class);
     
    +  //Skip operators that already have stats reported by org.apache.drill.exec.physical.impl.BaseRootExec
    +  private static final List<Integer> operatorStatsInitToSkip = Lists.newArrayList(
    --- End diff --
    
    The add-on commit refactors by having the BaseRootExec constructor handle the substition without risking going out of sync with other senders extending BaseRootExec.


---

[GitHub] drill pull request #1141: DRILL-6197: Skip duplicate entry for OperatorStats

Posted by kkhatua <gi...@git.apache.org>.
Github user kkhatua commented on a diff in the pull request:

    https://github.com/apache/drill/pull/1141#discussion_r171748860
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/ops/FragmentStats.java ---
    @@ -79,4 +71,21 @@ public void addOperatorStats(OperatorStats stats) {
         operators.add(stats);
       }
     
    +  //DRILL-6197
    +  public OperatorStats addOrReplaceOperatorStats(OperatorStats stats) {
    +    //Remove existing stat
    +    OperatorStats replacedStat = null;
    +    int index = 0;
    +    for (OperatorStats opStat : operators) {
    --- End diff --
    
    I added a new commit, but I haven't tested it for performance. Can you take a look, @amansinha100 ?


---

[GitHub] drill issue #1141: DRILL-6197: Skip duplicate entry for OperatorStats

Posted by amansinha100 <gi...@git.apache.org>.
Github user amansinha100 commented on the issue:

    https://github.com/apache/drill/pull/1141
  
    +1.  


---