You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by vvysotskyi <gi...@git.apache.org> on 2017/08/11 12:56:53 UTC

[GitHub] drill pull request #905: DRILL-1162: Fix OOM for hash join operator when the...

GitHub user vvysotskyi opened a pull request:

    https://github.com/apache/drill/pull/905

    DRILL-1162: Fix OOM for hash join operator when the right input has a much larger actual number of rows than the left one

    Please see the problem description in [DRILL-1162](https://issues.apache.org/jira/browse/DRILL-1162).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vvysotskyi/drill DRILL-1162

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/905.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #905
    
----
commit de9a32997774acf34d2cb42f534decac5fd75cb5
Author: Volodymyr Vysotskyi <vv...@gmail.com>
Date:   2017-08-09T21:21:48Z

    DRILL-1162: Fix OOM for hash join operator when the right input has a much larger actual number of rows than the left one

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill issue #905: DRILL-1162: Fix OOM for hash join operator when the right ...

Posted by vvysotskyi <gi...@git.apache.org>.

Github user vvysotskyi commented on the issue:

    https://github.com/apache/drill/pull/905
  
    @jinfengni, the idea is very interesting for me. More appropriate memory cost estimation will cause to choosing better plans, that will cause the less time execution and memory costs. 
    
    Writing about the comparing the ratio of column count vs ratio of row count I meant that the multiplying of row count and column count does not help to choose which input needs more memory in the case when the ratio of **actual** row count is much greater than the ratio of column count. `getCumulativeMemCost` implementation will not help for this case. 
    To show that, let's consider an example. We have three tables: 
    
    - a(has 5 columns and 5 rows with the same values); 
    - b(has 5 columns and 10 rows with the same values as in the table a); 
    - c(has 5 columns and 35 rows with the same values as in the table a). 
    
    For the query
    ```
    select count(*) from a
    inner join b on a.col1=b.col1
    inner join c on b.col1=c.col1;
    ```
    Drill will build plan
    ```
    	HahJoin[1]
    	/    \
           c   HahJoin[2]
    	   /    \
    	  b	 a
    ```
    `getCumulativeMemCost` for HahJoin[1] will return a value proportional to
    `(HahJoin[2] row count) * (HahJoin[2] column count) + HahJoin[2].getCumulativeMemCost() = Max(aRowCount, bRowCount) * (aColumnCount + bColumnCount) +  aRowCount * aColumnCount = 5 * (5 + 5) + 5 * 5 = 75`.
    Actual row count for build side of HahJoin[1] in this case is 50.
    For the plan, that will be more suitable for this particular case:
    ```
    	HahJoin[1]
    	/     \
      HahJoin[2]   c   
       /    \
      b	 a
    ```
    `getCumulativeMemCost` for HahJoin[1] will return a value proportional to `cRowCount * cColumnCount = 35 * 5 = 175`.
    Actual row count for build side of HahJoin[1] in this case is 35.
    
    Smal information about this pull request.
    This pull request addresses only the case of large row count. It checks that OOM may happen and if swap allows avoiding this potential OOM, the swap will happen.


---

[GitHub] drill issue #905: DRILL-1162: Fix OOM for hash join operator when the right ...

Posted by paul-rogers <gi...@git.apache.org>.

Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/905
  
    @jinfengni, have you been able to take a look at this PR?


---

[GitHub] drill pull request #905: DRILL-1162: Fix OOM for hash join operator when the...

Posted by amansinha100 <gi...@git.apache.org>.

Github user amansinha100 commented on a diff in the pull request:

    https://github.com/apache/drill/pull/905#discussion_r140417433
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/planner/cost/DrillRelMdMaxRowCount.java ---
    @@ -0,0 +1,71 @@
    +/*
    +* Licensed to the Apache Software Foundation (ASF) under one or more
    +* contributor license agreements.  See the NOTICE file distributed with
    +* this work for additional information regarding copyright ownership.
    +* The ASF licenses this file to you under the Apache License, Version 2.0
    +* (the "License"); you may not use this file except in compliance with
    +* the License.  You may obtain a copy of the License at
    +*
    +* http://www.apache.org/licenses/LICENSE-2.0
    +*
    +* Unless required by applicable law or agreed to in writing, software
    +* distributed under the License is distributed on an "AS IS" BASIS,
    +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +* See the License for the specific language governing permissions and
    +* limitations under the License.
    +*/
    +package org.apache.drill.exec.planner.cost;
    +
    +import org.apache.calcite.plan.volcano.RelSubset;
    +import org.apache.calcite.rel.SingleRel;
    +import org.apache.calcite.rel.core.TableScan;
    +import org.apache.calcite.rel.metadata.ReflectiveRelMetadataProvider;
    +import org.apache.calcite.rel.metadata.RelMdMaxRowCount;
    +import org.apache.calcite.rel.metadata.RelMetadataProvider;
    +import org.apache.calcite.rel.metadata.RelMetadataQuery;
    +import org.apache.calcite.util.BuiltInMethod;
    +import org.apache.drill.exec.planner.physical.AbstractPrel;
    +import org.apache.drill.exec.planner.physical.ScanPrel;
    +
    +/**
    + * DrillRelMdMaxRowCount supplies a specific implementation of
    + * {@link RelMetadataQuery#getMaxRowCount} for Drill.
    + */
    +public class DrillRelMdMaxRowCount extends RelMdMaxRowCount {
    +
    +  private static final DrillRelMdMaxRowCount INSTANCE = new DrillRelMdMaxRowCount();
    +
    +  public static final RelMetadataProvider SOURCE = ReflectiveRelMetadataProvider.reflectiveSource(BuiltInMethod.MAX_ROW_COUNT.method, INSTANCE);
    +
    +  public Double getMaxRowCount(ScanPrel rel, RelMetadataQuery mq) {
    +    // the actual row count is known so returns its value
    +    return rel.estimateRowCount(mq);
    --- End diff --
    
    Returning 'estimated' row count means that this is just an estimate, not the actual value which could be higher. Looking at the implementation of estimatedRowCount() for several of the storage/format plugins, there are several that use NO_EXACT_ROW_COUNT.  for instance see [1] for the text format plugin.  So, I feel overloading getMaxRowCount() to return an estimate may cause problems.  If you look at the semantics of getMaxRowCount in Calcite's RelMdMaxRowCount, it is only intended for cases where **_during planning time_** we can guarantee that the max row count will never exceed that value.  For example,  an Aggregate with no group-by clause or a LIMIT etc.  
    
    
    [1] https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/easy/EasyFormatPlugin.java#L186


---

[GitHub] drill issue #905: DRILL-1162: Fix OOM for hash join operator when the right ...

Posted by jinfengni <gi...@git.apache.org>.

Github user jinfengni commented on the issue:

https://github.com/apache/drill/pull/905

This OOM problem exposes two problems. The first one is in planning time, where we choose a sub-optimal plan, due to the inaccurate estimation of row count because of missing of appropriate statistics. The second one is in execution time, where we may need understand whether Drill uses too much memory and whether spill to disk is an option. I think the two are complementary to each other; even when we have spill to disk for hash join, if planner choose a sub-optimal plan, the query still could take long, long time to complete.
Looks like the PR is addressing the 1st issue. I agree that the root cause is row count estimation, which is more appropriate to defer to the enhancement of statistics support. For swapJoin logic, the proposal of getMaxRowCount() seems to be in the line of adjusting row count estimation. I like better the idea of combining row count + column count, which was essentially adopted in swapInput() by LoptJoinOptmizeRule.
For HashTable build side cost, hash table only has to hold the join key. However, since hash join is a blocking operator, it has hold all the records in the build side, meaning total memory requirement (for both hash table + non-join key columns) depends on row count and column count. Therefore, the cost model of hash join should reflect that. Can we use similar idea in SwapHashJoinVisitor?
One further improvement would be to modify HashJoinPrel.computeSelfCost(). Today we only consider join key width, and it makes sense to adjust that logic, by considering the total column counts in build side. Such logic could be extracted into a common places, then SwapHashJoinVisitor could call the same shared logic to decide whether it's cost-wise optimal to swap the input sides. Thoughts?

---

[GitHub] drill pull request #905: DRILL-1162: Fix OOM for hash join operator when the...

Posted by vvysotskyi <gi...@git.apache.org>.

Github user vvysotskyi commented on a diff in the pull request:

    https://github.com/apache/drill/pull/905#discussion_r140730762
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/planner/cost/DrillRelMdMaxRowCount.java ---
    @@ -0,0 +1,71 @@
    +/*
    +* Licensed to the Apache Software Foundation (ASF) under one or more
    +* contributor license agreements.  See the NOTICE file distributed with
    +* this work for additional information regarding copyright ownership.
    +* The ASF licenses this file to you under the Apache License, Version 2.0
    +* (the "License"); you may not use this file except in compliance with
    +* the License.  You may obtain a copy of the License at
    +*
    +* http://www.apache.org/licenses/LICENSE-2.0
    +*
    +* Unless required by applicable law or agreed to in writing, software
    +* distributed under the License is distributed on an "AS IS" BASIS,
    +* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +* See the License for the specific language governing permissions and
    +* limitations under the License.
    +*/
    +package org.apache.drill.exec.planner.cost;
    +
    +import org.apache.calcite.plan.volcano.RelSubset;
    +import org.apache.calcite.rel.SingleRel;
    +import org.apache.calcite.rel.core.TableScan;
    +import org.apache.calcite.rel.metadata.ReflectiveRelMetadataProvider;
    +import org.apache.calcite.rel.metadata.RelMdMaxRowCount;
    +import org.apache.calcite.rel.metadata.RelMetadataProvider;
    +import org.apache.calcite.rel.metadata.RelMetadataQuery;
    +import org.apache.calcite.util.BuiltInMethod;
    +import org.apache.drill.exec.planner.physical.AbstractPrel;
    +import org.apache.drill.exec.planner.physical.ScanPrel;
    +
    +/**
    + * DrillRelMdMaxRowCount supplies a specific implementation of
    + * {@link RelMetadataQuery#getMaxRowCount} for Drill.
    + */
    +public class DrillRelMdMaxRowCount extends RelMdMaxRowCount {
    +
    +  private static final DrillRelMdMaxRowCount INSTANCE = new DrillRelMdMaxRowCount();
    +
    +  public static final RelMetadataProvider SOURCE = ReflectiveRelMetadataProvider.reflectiveSource(BuiltInMethod.MAX_ROW_COUNT.method, INSTANCE);
    +
    +  public Double getMaxRowCount(ScanPrel rel, RelMetadataQuery mq) {
    +    // the actual row count is known so returns its value
    +    return rel.estimateRowCount(mq);
    --- End diff --
    
    Taking into account that `ScanPrel` may return estimated value, I completely agree with you that this change will cause problems during planning time. 
    Thanks for pointing this.


---

[GitHub] drill issue #905: DRILL-1162: Fix OOM for hash join operator when the right ...

Posted by jinfengni <gi...@git.apache.org>.

Github user jinfengni commented on the issue:

    https://github.com/apache/drill/pull/905
  
    @vvysotskyi , the example you listed (three tables a,b,c all have same values) seems to be essentially cross-join. For such cases, clearly the current rowCount estimation is way off from the real number, which would impact the estimation of hash join memory cost, and hence the proposed idea would not work. However, I feel it's not a very common case to have two tables joined like a cross join. The question is : does it make sense to modify cost estimation for seemly uncommonly use case?



---

[GitHub] drill pull request #905: DRILL-1162: Fix OOM for hash join operator when the...

Posted by vvysotskyi <gi...@git.apache.org>.

Github user vvysotskyi closed the pull request at:

    https://github.com/apache/drill/pull/905


---

[GitHub] drill issue #905: DRILL-1162: Fix OOM for hash join operator when the right ...

Posted by jinfengni <gi...@git.apache.org>.

Github user jinfengni commented on the issue:

    https://github.com/apache/drill/pull/905
  
    @vvysotskyi , what I'm thinking is not just comparing the ratio of column count vs ratio of row count.  
    
    Let's take a step back. This SwapHashJoinVisitor is trying to correct the join decision made by either VolcanoPlanner or HepPlanner, in the sense the original join order might need put bigger dataset on the build side. The swap is needed because we want to avoid high memory requirement on hash join operator. The question is how do we define "big dataset". For now, SwapHashJoinVisitor simply uses rowCount, which is not sufficient. 
    
    In stead, we probably should add a method `getSelfMemCost` to all `Prel` node. For non-blocking operator, it's simply returning either 0 or some constant (to hold one single batch). For non-blocking operator such as HashJoin, it will return a value proportional to rowCount X columnCount (more precisely, total number of bytes per row, considering different column data type). 
    
    Same as existing method of `computeSelfCost`, we need `getCumulativeMemCost` which will return the cumulative cost for child nodes rooted at one `Prel` node.  With this `getSelfMemCost` and `getCumulativeMemCost` defined for HashJoin, and a HashJoin with input1, input2 as inputs, we could estimate cumulative memory cost for HashJoin(input1, input2), and HashJoin(input2, input1), and use that as criteria to decide whether we have to switch them. 
    
    This idea is not trying to adjust the row count estimation. In stead, it's trying to change the criteria where we may think it's necessary to swap, based on the observation that we want to do swap only when we want to reduce memory requirement for a query. 
    
    Will the above idea work? If it could not address this issue, it's probably fine to go with what you proposed. Before we go that option, please give some thoughts about the above idea. 
    



---

[GitHub] drill issue #905: DRILL-1162: Fix OOM for hash join operator when the right ...

Posted by paul-rogers <gi...@git.apache.org>.

Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/905
  
    @jinfengni, I think you are more familiar with this part of the code, can you take a look?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill issue #905: DRILL-1162: Fix OOM for hash join operator when the right ...

Posted by vvysotskyi <gi...@git.apache.org>.

Github user vvysotskyi commented on the issue:

    https://github.com/apache/drill/pull/905
  
    @jinfengni thanks for looking into this. Completely agree with you that it would be better to consider both row and column count. 
    Unfortunately, it does not help to fix this issue, since the ratio of column count of tables very often much smaller than the ratio of row count (actual row count). (`c1/c2 << r2/r1`)
    So I guess it makes sense to implement your proposal to consider the column count together with the row count in another pull request.


---

[GitHub] drill issue #905: DRILL-1162: Fix OOM for hash join operator when the right ...

Posted by paul-rogers <gi...@git.apache.org>.

Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/905
  
    @amansinha100, can you give this one a review? 


---

[GitHub] drill issue #905: DRILL-1162: Fix OOM for hash join operator when the right ...

Posted by priteshm <gi...@git.apache.org>.

Github user priteshm commented on the issue:

    https://github.com/apache/drill/pull/905
  
    @paul-rogers @jinfengni  can you please review this one?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill issue #905: DRILL-1162: Fix OOM for hash join operator when the right ...

Posted by vvysotskyi <gi...@git.apache.org>.

Github user vvysotskyi commented on the issue:

    https://github.com/apache/drill/pull/905
  
    Since Drill does not have enough information about tables in the planning time to avoid this OOM and since without using the `getMaxRowCount()` for `ScanPrel`, current approach could not be used and it would be better to defer the fix for this issue until the table statistics is implemented.


---

[GitHub] drill issue #905: DRILL-1162: Fix OOM for hash join operator when the right ...

Posted by vvysotskyi <gi...@git.apache.org>.

Github user vvysotskyi commented on the issue:

    https://github.com/apache/drill/pull/905
  
    @jinfengni it was an inflated example, but considering the case of multiple joins and when tables have several repeated values, the result will be the same. 


---