You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/08/24 05:16:29 UTC

[GitHub] [druid] suneet-s opened a new pull request #10312: Optimize large InDimFilters

suneet-s opened a new pull request #10312:
URL: https://github.com/apache/druid/pull/10312


   ### Description
   
   For large InDimFilters, in default mode, the filter does a linear check of the
   set to see if it contains either an empty or null. If it does, the empties are
   converted to nulls by passing through the entire list again.
   
   Instead of this, in default mode, we attempt to remove an empty string from the
   values that are passed to the InDimFilter. If an empty string was removed, we
   add null to the set
   
   <img width="1659" alt="Screen Shot 2020-08-23 at 10 12 30 PM" src="https://user-images.githubusercontent.com/44787917/91006291-d383c880-e58d-11ea-951b-38ec1bc92255.png">
   
   This flame graph shows that ~18% of query time was just spent checking if a null or empty string exists in the list of values to the InDimFilter. This happened on a join query where a filter was pushed down to the base table. The limit for filter push down was increased to a very large number so that a very large InDimFilter could be generated.
   
   <hr>
   
   This PR has:
   - [ ] been self-reviewed.
      - [ ] using the [concurrency checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md) (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in [licenses.yaml](https://github.com/apache/druid/blob/master/licenses.yaml)
   - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s merged pull request #10312: Optimize large InDimFilters

Posted by GitBox <gi...@apache.org>.
suneet-s merged pull request #10312:
URL: https://github.com/apache/druid/pull/10312


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on a change in pull request #10312: Optimize large InDimFilters

Posted by GitBox <gi...@apache.org>.
abhishekagarwal87 commented on a change in pull request #10312:
URL: https://github.com/apache/druid/pull/10312#discussion_r475361509



##########
File path: processing/src/main/java/org/apache/druid/query/filter/InDimFilter.java
##########
@@ -143,10 +142,11 @@ private InDimFilter(
 
     // The values set can be huge. Try to avoid copying the set if possible.
     // Note that we may still need to copy values to a list for caching. See getCacheKey().
-    if ((NullHandling.sqlCompatible() || values.stream().noneMatch(NullHandling::needsEmptyToNull))) {
+    if (NullHandling.sqlCompatible() || !values.remove("")) {
       this.values = values;
     } else {
-      this.values = values.stream().map(NullHandling::emptyToNullIfNeeded).collect(Collectors.toSet());
+      values.add(null);

Review comment:
       In most likelihood, the `values` is a `HashSet` which is internally backed by a `HashMap` and hence `remove, `contains` will be much faster than linear scan. 
   
   ```suggestion
         if ((NullHandling.sqlCompatible()) {
         this.values = values;
       } else if (values.remove("")) {
           values.add(null);
       }
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on a change in pull request #10312: Optimize large InDimFilters

Posted by GitBox <gi...@apache.org>.
jihoonson commented on a change in pull request #10312:
URL: https://github.com/apache/druid/pull/10312#discussion_r475359730



##########
File path: processing/src/main/java/org/apache/druid/query/filter/InDimFilter.java
##########
@@ -143,10 +142,11 @@ private InDimFilter(
 
     // The values set can be huge. Try to avoid copying the set if possible.
     // Note that we may still need to copy values to a list for caching. See getCacheKey().
-    if ((NullHandling.sqlCompatible() || values.stream().noneMatch(NullHandling::needsEmptyToNull))) {
+    if (NullHandling.sqlCompatible() || !values.remove("")) {
       this.values = values;
     } else {
-      this.values = values.stream().map(NullHandling::emptyToNullIfNeeded).collect(Collectors.toSet());
+      values.add(null);

Review comment:
       Oh sorry, you're right. I misread the code. It would be nice to add some comment about what we want to do here since it's not much intuitive.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on pull request #10312: Optimize large InDimFilters

Posted by GitBox <gi...@apache.org>.
suneet-s commented on pull request #10312:
URL: https://github.com/apache/druid/pull/10312#issuecomment-679407682


   @jihoonson Line coverage is failing for the sql compatibility tests. Because the tests run with sql compatibility mode, we can not get line coverage inside that if block. The processing module tests pass both branch and line coverage, so I think the existing tests cover all the branches.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on pull request #10312: Optimize large InDimFilters

Posted by GitBox <gi...@apache.org>.
suneet-s commented on pull request #10312:
URL: https://github.com/apache/druid/pull/10312#issuecomment-679402952


   > The code change looks good to me. Can we add the test changes in https://github.com/apache/druid/pull/10312/files/a14f3807d34edd0a0e22c3e01b4fc69164d634e7..61fe33ebf762764bb89108ddd966937f3313be71#diff-a8ef4fb53d2e51cef400d9a903a4b3f7R61-R91 back to make coverage bot happy?
   
   Ah oops I thought this was covered by some other tests


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on a change in pull request #10312: Optimize large InDimFilters

Posted by GitBox <gi...@apache.org>.
jihoonson commented on a change in pull request #10312:
URL: https://github.com/apache/druid/pull/10312#discussion_r475353053



##########
File path: processing/src/main/java/org/apache/druid/query/filter/InDimFilter.java
##########
@@ -143,10 +142,11 @@ private InDimFilter(
 
     // The values set can be huge. Try to avoid copying the set if possible.
     // Note that we may still need to copy values to a list for caching. See getCacheKey().
-    if ((NullHandling.sqlCompatible() || values.stream().noneMatch(NullHandling::needsEmptyToNull))) {
+    if (NullHandling.sqlCompatible() || !values.remove("")) {
       this.values = values;
     } else {
-      this.values = values.stream().map(NullHandling::emptyToNullIfNeeded).collect(Collectors.toSet());
+      values.add(null);

Review comment:
       Nice finding, but this can lead to incorrect result in default mode since `null` will be always added no matter whether an empty string is in `values` or not.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on a change in pull request #10312: Optimize large InDimFilters

Posted by GitBox <gi...@apache.org>.
abhishekagarwal87 commented on a change in pull request #10312:
URL: https://github.com/apache/druid/pull/10312#discussion_r475361509



##########
File path: processing/src/main/java/org/apache/druid/query/filter/InDimFilter.java
##########
@@ -143,10 +142,11 @@ private InDimFilter(
 
     // The values set can be huge. Try to avoid copying the set if possible.
     // Note that we may still need to copy values to a list for caching. See getCacheKey().
-    if ((NullHandling.sqlCompatible() || values.stream().noneMatch(NullHandling::needsEmptyToNull))) {
+    if (NullHandling.sqlCompatible() || !values.remove("")) {
       this.values = values;
     } else {
-      this.values = values.stream().map(NullHandling::emptyToNullIfNeeded).collect(Collectors.toSet());
+      values.add(null);

Review comment:
       In most likelihood, the `values` is a `HashSet` which is internally backed by a `HashMap` and hence `remove`, `contains` will be much faster than linear scan. 
   
   ```suggestion
         if ((NullHandling.sqlCompatible()) {
         this.values = values;
       } else if (values.remove("")) {
           values.add(null);
       }
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on a change in pull request #10312: Optimize large InDimFilters

Posted by GitBox <gi...@apache.org>.
suneet-s commented on a change in pull request #10312:
URL: https://github.com/apache/druid/pull/10312#discussion_r475354734



##########
File path: processing/src/main/java/org/apache/druid/query/filter/InDimFilter.java
##########
@@ -143,10 +142,11 @@ private InDimFilter(
 
     // The values set can be huge. Try to avoid copying the set if possible.
     // Note that we may still need to copy values to a list for caching. See getCacheKey().
-    if ((NullHandling.sqlCompatible() || values.stream().noneMatch(NullHandling::needsEmptyToNull))) {
+    if (NullHandling.sqlCompatible() || !values.remove("")) {
       this.values = values;
     } else {
-      this.values = values.stream().map(NullHandling::emptyToNullIfNeeded).collect(Collectors.toSet());
+      values.add(null);

Review comment:
       I think the if condition on line 145 works such that it only enters this else block if there was an empty string already in the set. I should add unit tests here, because I had to think a lot about how the ordering of the if statements affected the code flow 😅 and how nulls and empty strings work in the different modes...
   
   What I want to happen here is:
   * If it's sqlCompatible mode, just use the values as is
   * If it's default mode (ie not sql compatible mode) attempt to remove empty string
   * If an empty string was removed add null
   * If no empty  string was removed, use values as is




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on a change in pull request #10312: Optimize large InDimFilters

Posted by GitBox <gi...@apache.org>.
suneet-s commented on a change in pull request #10312:
URL: https://github.com/apache/druid/pull/10312#discussion_r475785013



##########
File path: processing/src/main/java/org/apache/druid/query/filter/InDimFilter.java
##########
@@ -143,10 +142,11 @@ private InDimFilter(
 
     // The values set can be huge. Try to avoid copying the set if possible.
     // Note that we may still need to copy values to a list for caching. See getCacheKey().
-    if ((NullHandling.sqlCompatible() || values.stream().noneMatch(NullHandling::needsEmptyToNull))) {
+    if (NullHandling.sqlCompatible() || !values.remove("")) {
       this.values = values;
     } else {
-      this.values = values.stream().map(NullHandling::emptyToNullIfNeeded).collect(Collectors.toSet());
+      values.add(null);

Review comment:
       I've re-written the if statement to hopefully be easier to follow.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org