You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/08/24 15:31:52 UTC

[GitHub] [druid] suneet-s opened a new pull request #10316: Optimize InDimFilter hashCode calculation

suneet-s opened a new pull request #10316:
URL: https://github.com/apache/druid/pull/10316


   ### Description
   
   InDimFilter can operate on a large set of values. Computing the hashCode for
   this large set of values can be expensive. Instead of this, Druid can use the
   number of values in the filter to compute the hashCode. This should speed up
   the computation with the side-effect of higher collisions.
   
   The equals method will still check every value in the list, so 2 filters
   operating on the same dimension with the same filter shape and values, will
   not be considered equal.
   
   <img width="1672" alt="Screen Shot 2020-08-24 at 8 29 54 AM" src="https://user-images.githubusercontent.com/44787917/91064397-0b1d5f80-e5e4-11ea-85c1-2c312db579a4.png">
   
   This flamegraph shows a query that spends ~10% of it's time calculating the hashCode for the InDimFilter which has a large number of values
   
   <hr>
   
   This PR has:
   - [ ] been self-reviewed.
      - [ ] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/licenses.yaml)
   - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s closed pull request #10316: Memoize InDimFilter hashCode calculation

Posted by GitBox <gi...@apache.org>.
suneet-s closed pull request #10316:
URL: https://github.com/apache/druid/pull/10316


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on pull request #10316: Memoize InDimFilter hashCode calculation

Posted by GitBox <gi...@apache.org>.
abhishekagarwal87 commented on pull request #10316:
URL: https://github.com/apache/druid/pull/10316#issuecomment-681631844


   there are two other places where this InDimFilter is being created and there too an ImmutableSet can be used. As Gian pointed out, the ImmutableSet caches the hashCode while it's building the set. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on pull request #10316: Memoize InDimFilter hashCode calculation

Posted by GitBox <gi...@apache.org>.
suneet-s commented on pull request #10316:
URL: https://github.com/apache/druid/pull/10316#issuecomment-682019819


   > > ```
   > > diff --git a/processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java b/processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java
   > > index 67b1ee0b76..7da31f3811 100644
   > > --- a/processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java
   > > +++ b/processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java
   > > @@ -435,6 +435,11 @@ public class JoinFilterAnalyzer
   > >            );
   > >          }
   > >  
   > > +        // Wrap filter values in immutable set so that classes consuming this set do not compute the hash code
   > > +        // again and again. We are creating multiple InDimFilter filters down for the same set of filter values.
   > > +        // Calculating hashCode will be less expensive for these InDimFilter as hashCode for filter values is pre-computed
   > > +        newFilterValues = ImmutableSet.copyOf(newFilterValues);
   > > +
   > >          for (String correlatedBaseColumn : correlationAnalysis.getBaseColumns()) {
   > >            Filter rewrittenFilter = new InDimFilter(
   > >                correlatedBaseColumn,
   > > ```
   > > 
   > > 
   > > This is what I had in mind.
   > 
   > I initially didn't want to use an ImmutableSet because `ImmutableSet.copyOf(...)` would have to traverse through the entire set, and there may be cases where an InDimFilter doesn't need to traverse the entire set. However, this comment made me think that maybe the best way to do this is to have the correlatedValuesMap use an `ImmutableSet.Builder` instead. This way the hashCode is memoized as the set is being constructed! I might leave the other code paths as is so that this change only really impacts join clauses. What do you think @abhishekagarwal87 ?
   
   I changed my mind... decided it's better to spend a little more time thinking about this
   
   IndexedTableJoinable needs to know the number of uniques while constructing the Set so that we can limit the number of values that can be pushed down. The ImmutableSetBuilder doesn't know the number of uniques till the time the Set is constructed
   ```
   IntList rowIndex = index.find(searchColumnValue);
           for (int i = 0; i < rowIndex.size(); i++) {
             int rowNum = rowIndex.getInt(i);
             String correlatedDimVal = Objects.toString(reader.read(rowNum), null);
             correlatedValuesBuilder.add(correlatedDimVal);
   
             if (correlatedValuesBuilder.size() > maxCorrelationSetSize) {
               return Optional.empty();
             }
           }
           return Optional.of(correlatedValuesBuilder.build());
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #10316: Optimize InDimFilter hashCode calculation

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #10316:
URL: https://github.com/apache/druid/pull/10316#discussion_r476197783



##########
File path: processing/src/main/java/org/apache/druid/query/filter/InDimFilter.java
##########
@@ -382,7 +386,7 @@ public boolean equals(Object o)
   @Override
   public int hashCode()
   {
-    return Objects.hash(values, dimension, extractionFn, filterTuning);
+    return Objects.hash(values.size(), dimension, extractionFn, filterTuning);

Review comment:
       Maybe use the size and the first few values?
   
   It's easy to imagine situations where the extra collisions from only checking size are a problem, and it's tough to imagine situations where the perf impact of adding[the first few values is going to be big. So it seems like a good idea.
   
   Please also include a comment about the rationale for the nonstandard hashCode impl. It'd be good to link to this PR.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on pull request #10316: Memoize InDimFilter hashCode calculation

Posted by GitBox <gi...@apache.org>.
abhishekagarwal87 commented on pull request #10316:
URL: https://github.com/apache/druid/pull/10316#issuecomment-681627518


   ```
   diff --git a/processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java b/processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java
   index 67b1ee0b76..7da31f3811 100644
   --- a/processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java
   +++ b/processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java
   @@ -435,6 +435,11 @@ public class JoinFilterAnalyzer
              );
            }
    
   +        // Wrap filter values in immutable set so that classes consuming this set do not compute the hash code
   +        // again and again. We are creating multiple InDimFilter filters down for the same set of filter values.
   +        // Calculating hashCode will be less expensive for these InDimFilter as hashCode for filter values is pre-computed
   +        newFilterValues = ImmutableSet.copyOf(newFilterValues);
   +
            for (String correlatedBaseColumn : correlationAnalysis.getBaseColumns()) {
              Filter rewrittenFilter = new InDimFilter(
                  correlatedBaseColumn,
   
   ```
   
   This is what I had in mind. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on pull request #10316: Memoize InDimFilter hashCode calculation

Posted by GitBox <gi...@apache.org>.
suneet-s commented on pull request #10316:
URL: https://github.com/apache/druid/pull/10316#issuecomment-683212615


   I'm not happy with this approach. Going to think about this for a little more time and I'll re-open when I think of a better approach.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #10316: Optimize InDimFilter hashCode calculation

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #10316:
URL: https://github.com/apache/druid/pull/10316#discussion_r476750999



##########
File path: processing/src/main/java/org/apache/druid/query/filter/InDimFilter.java
##########
@@ -382,7 +386,7 @@ public boolean equals(Object o)
   @Override
   public int hashCode()
   {
-    return Objects.hash(values, dimension, extractionFn, filterTuning);
+    return Objects.hash(values.size(), dimension, extractionFn, filterTuning);

Review comment:
       ImmutableSet computes its hashcode as it is built and then caches it.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #10316: Optimize InDimFilter hashCode calculation

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #10316:
URL: https://github.com/apache/druid/pull/10316#discussion_r476197783



##########
File path: processing/src/main/java/org/apache/druid/query/filter/InDimFilter.java
##########
@@ -382,7 +386,7 @@ public boolean equals(Object o)
   @Override
   public int hashCode()
   {
-    return Objects.hash(values, dimension, extractionFn, filterTuning);
+    return Objects.hash(values.size(), dimension, extractionFn, filterTuning);

Review comment:
       Maybe use the size and the first few values?
   
   It's easy to imagine situations where the extra collisions from only checking size are a problem, and it's tough to imagine situations where the perf impact of adding the first few values is going to be big. So it seems like a good idea.
   
   Please also include a comment about the rationale for the nonstandard hashCode impl. It'd be good to link to this PR.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on a change in pull request #10316: Optimize InDimFilter hashCode calculation

Posted by GitBox <gi...@apache.org>.
abhishekagarwal87 commented on a change in pull request #10316:
URL: https://github.com/apache/druid/pull/10316#discussion_r476398225



##########
File path: processing/src/main/java/org/apache/druid/query/filter/InDimFilter.java
##########
@@ -382,7 +386,7 @@ public boolean equals(Object o)
   @Override
   public int hashCode()
   {
-    return Objects.hash(values, dimension, extractionFn, filterTuning);
+    return Objects.hash(values.size(), dimension, extractionFn, filterTuning);

Review comment:
       It's interesting though that `values` itself is a HashSet being passed to InDimFilter which would mean hash code is evaluated for all the elements in the set. But that penalty for constructing `values` doesn't show up in the graph. is the full flame graph available to look further? 
   I can see in one place where multiple `InDimFilter` are created with the same `values`.  Maybe that's the part responsible for perf penalty. If there is a `Set` type that remembers its hashCode, using such type for `values` could be more beneficial. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on pull request #10316: Memoize InDimFilter hashCode calculation

Posted by GitBox <gi...@apache.org>.
suneet-s commented on pull request #10316:
URL: https://github.com/apache/druid/pull/10316#issuecomment-682000989


   > ```
   > diff --git a/processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java b/processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java
   > index 67b1ee0b76..7da31f3811 100644
   > --- a/processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java
   > +++ b/processing/src/main/java/org/apache/druid/segment/join/filter/JoinFilterAnalyzer.java
   > @@ -435,6 +435,11 @@ public class JoinFilterAnalyzer
   >            );
   >          }
   >  
   > +        // Wrap filter values in immutable set so that classes consuming this set do not compute the hash code
   > +        // again and again. We are creating multiple InDimFilter filters down for the same set of filter values.
   > +        // Calculating hashCode will be less expensive for these InDimFilter as hashCode for filter values is pre-computed
   > +        newFilterValues = ImmutableSet.copyOf(newFilterValues);
   > +
   >          for (String correlatedBaseColumn : correlationAnalysis.getBaseColumns()) {
   >            Filter rewrittenFilter = new InDimFilter(
   >                correlatedBaseColumn,
   > ```
   > 
   > This is what I had in mind.
   
   I initially didn't want to use an ImmutableSet because `ImmutableSet.copyOf(...)` would have to traverse through the entire set, and there may be cases where an InDimFilter doesn't need to traverse the entire set. However, this comment made me think that maybe the best way to do this is to have the correlatedValuesMap use an `ImmutableSet.Builder` instead. This way the hashCode is memoized as the set is being constructed! I might leave the other code paths as is so that this change only really impacts join clauses. What do you think @abhishekagarwal87 ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on a change in pull request #10316: Optimize InDimFilter hashCode calculation

Posted by GitBox <gi...@apache.org>.
suneet-s commented on a change in pull request #10316:
URL: https://github.com/apache/druid/pull/10316#discussion_r476608590



##########
File path: processing/src/main/java/org/apache/druid/query/filter/InDimFilter.java
##########
@@ -382,7 +386,7 @@ public boolean equals(Object o)
   @Override
   public int hashCode()
   {
-    return Objects.hash(values, dimension, extractionFn, filterTuning);
+    return Objects.hash(values.size(), dimension, extractionFn, filterTuning);

Review comment:
       @abhishekagarwal87 Good point... I'm not sure why the construction time doesn't show up. I'll check if there is a set that memoizes it's hashcode as the set is being constructed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org