You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2022/02/06 21:17:59 UTC

[GitHub] [druid] suneet-s commented on a change in pull request #12195: Moving in filter check to broker

suneet-s commented on a change in pull request #12195:
URL: https://github.com/apache/druid/pull/12195#discussion_r800227896



##########
File path: docs/querying/query-context.md
##########
@@ -62,7 +62,7 @@ Unless otherwise noted, the following parameters apply to all query types.
 |secondaryPartitionPruning|`true`|Enable secondary partition pruning on the Broker. The Broker will always prune unnecessary segments from the input scan based on a filter on time intervals, but if the data is further partitioned with hash or range partitioning, this option will enable additional pruning based on a filter on secondary partition dimensions.|
 |enableJoinLeftTableScanDirect|`false`|This flag applies to queries which have joins. For joins, where left child is a simple scan with a filter,  by default, druid will run the scan as a query and the join the results to the right child on broker. Setting this flag to true overrides that behavior and druid will attempt to push the join to data servers instead. Please note that the flag could be applicable to queries even if there is no explicit join. since queries can internally translated into a join by the SQL planner.|
 |debug| `false` | Flag indicating whether to enable debugging outputs for the query. When set to false, no additional logs will be produced (logs produced will be entirely dependent on your logging level). When set to true, the following addition logs will be produced:<br />- Log the stack trace of the exception (if any) produced by the query |
-
+|maxNumericInFilters|`-1`|Max limit for the amount of numeric values that can be compared for a string type dimension when the entire SQL WHERE clause of a query translates to an [IN filter](../querying/filters.md#in-filter). By default, Druid does not restrict the amount of numbers in an IN filter, although this situation may block other queries from running. Set this property to a smaller value to prevent Druid from running queries that have prohibitively long segment processing times. The optimal limit requires some trial and error; we recommend starting with 100.  Users who submit a query that exceeds the limit of `maxNumericInFilters` should instead rewrite their queries to use strings in the IN filter instead of numbers. For example, `WHERE someString IN (‘123’, ‘456’)`. This value cannot exceed the set system configuration `druid.sql.planner.maxNumericInFilters`. This value is ignored if `druid.sql.planner.maxNumericInFilters` is not set explicitly.|

Review comment:
       Docs should include a note of how this field interacts with the system property `maxNumericInFilters`

##########
File path: sql/src/main/java/org/apache/druid/sql/calcite/planner/PlannerConfig.java
##########
@@ -190,6 +210,46 @@ public PlannerConfig withOverrides(final Map<String, Object> context)
     return newConfig;
   }
 
+  private int validateMaxNumericInFilters(int queryContextMaxNumericInFilters, int systemConfigMaxNumericInFilters)
+  {
+    // if maxNumericInFIlters through context == 0 catch exception
+    // else if query context exceeds system set value throw error
+    if (queryContextMaxNumericInFilters == 0) {
+      throw new UOE("[%s] must be greater than 0", CTX_MAX_NUMERIC_IN_FILTERS);
+    } else if (queryContextMaxNumericInFilters > systemConfigMaxNumericInFilters
+               && systemConfigMaxNumericInFilters != NUM_FILTER_NOT_USED) {
+      throw new UOE(
+          "Expected parameter[%s] cannot exceed system set value of [%d]",
+          CTX_MAX_NUMERIC_IN_FILTERS,
+          systemConfigMaxNumericInFilters
+      );
+    }
+    // if system set value is not present, thereby inferring default of -1
+    if (systemConfigMaxNumericInFilters == NUM_FILTER_NOT_USED) {
+      return systemConfigMaxNumericInFilters;

Review comment:
       I think you want this to return the queryContextMaxNumericInFilters in this case. The reason I think this is a desired behavior is because it allows a sys admin to test what a reasonable default would be for the cluster by using the query context. Once the sys admin sets the maxNumericInFilters to some value, then having the system config be an upper limit makes sense. This should be documented in `configuration/index.md` and `querying/query-context.md`
   ```suggestion
         return queryContextMaxNumericInFilters;
   ```

##########
File path: sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java
##########
@@ -6903,6 +6904,67 @@ public void testMaxSubqueryRows() throws Exception
     );
   }
 
+  @Test
+  public void testZeroMaxNumericInFilter() throws Exception
+  {
+    expectedException.expect(UOE.class);
+    expectedException.expectMessage("[maxNumericInFilters] must be greater than 0");
+
+    testQuery(
+        PLANNER_CONFIG_DEFAULT,
+        ImmutableMap.of(QueryContexts.MAX_NUMERIC_IN_FILTERS, 0),

Review comment:
       Can we add a test for a non 0 query context with the planner config default to address my comment in PlannerConfig

##########
File path: sql/src/test/java/org/apache/druid/sql/calcite/CalciteQueryTest.java
##########
@@ -6903,6 +6904,67 @@ public void testMaxSubqueryRows() throws Exception
     );
   }
 
+  @Test
+  public void testZeroMaxNumericInFilter() throws Exception
+  {
+    expectedException.expect(UOE.class);
+    expectedException.expectMessage("[maxNumericInFilters] must be greater than 0");
+
+    testQuery(
+        PLANNER_CONFIG_DEFAULT,
+        ImmutableMap.of(QueryContexts.MAX_NUMERIC_IN_FILTERS, 0),
+        "SELECT COUNT(*)\n"
+        + "FROM druid.numfoo\n"
+        + "WHERE dim6 IN (\n"
+        + "1,2,3\n"
+        + ")\n",
+        CalciteTests.REGULAR_USER_AUTH_RESULT,
+        ImmutableList.of(),
+        ImmutableList.of()
+    );
+  }
+
+  @Test
+  public void testHighestMaxNumericInFilter() throws Exception
+  {
+    expectedException.expect(UOE.class);
+    expectedException.expectMessage("Expected parameter[maxNumericInFilters] cannot exceed system set value of [100]");
+
+    testQuery(
+        PLANNER_CONFIG_MAX_NUMERIC_IN_FILTER,
+        ImmutableMap.of(QueryContexts.MAX_NUMERIC_IN_FILTERS, 20000),
+        "SELECT COUNT(*)\n"
+        + "FROM druid.numfoo\n"
+        + "WHERE dim6 IN (\n"
+        + "1,2,3\n"
+        + ")\n",
+        CalciteTests.REGULAR_USER_AUTH_RESULT,
+        ImmutableList.of(),
+        ImmutableList.of()
+    );
+  }
+
+  @Test
+  public void testQueryWithMoreThanMaxNumericInFilter() throws Exception
+  {
+    expectedException.expect(UOE.class);
+    expectedException.expectMessage("The number of values in the IN clause for [dim6] in query exceeds configured maxNumericFilter limit of [2] for INs. Cast [3] values of IN clause to String");
+
+    testQuery(
+        PLANNER_CONFIG_MAX_NUMERIC_IN_FILTER,
+        ImmutableMap.of(QueryContexts.MAX_NUMERIC_IN_FILTERS, 2),
+        "SELECT COUNT(*)\n"
+        + "FROM druid.numfoo\n"
+        + "WHERE dim6 IN (\n"

Review comment:
       I think we'll want some additional tests here for more complex filters - Some examples:
   * ORs of multiple dimensions eg. something like dim1 = 1 OR dim2 =2 or dim3 = 3
   * OR filter several levels deep eg. dim1='one' AND dim2 IN (1,2,3,4,5)
   * NOT IN filter eg. `dim6 NOT IN (...)`

##########
File path: sql/src/main/java/org/apache/druid/sql/calcite/run/NativeQueryMaker.java
##########
@@ -122,6 +127,32 @@ public boolean feature(QueryFeature feature)
         );
       }
     }
+    int numFilters = plannerContext.getPlannerConfig().getMaxNumericInFilters();
+
+    // special corner case handling for numeric IN filters
+    // in case of query containing IN (v1, v2, v3,...) where Vi is numeric
+    // a BoundFilter is created internally for each of the values
+    // whereas when Vi s are String the Filters are converted as BoundFilter to SelectorFilter to InFilter
+    // which takes lesser processing for bitmaps
+    // So in a case where user executes a query with multiple numeric INs, flame graph shows BoundFilter.getBitmapResult
+    // and BoundFilter.match predicate eating up processing time which stalls a historical for a query with large number
+    // of numeric INs (> 10K). In such cases user should change the query to specify the IN clauses as String
+    // Instead of IN(v1,v2,v3) user should specify IN('v1','v2','v3')
+    if (numFilters != PlannerConfig.NUM_FILTER_NOT_USED) {
+      if (query.getFilter() instanceof OrDimFilter) {

Review comment:
       This looks like it would only supports filtering on a top level OR filter. However if a filter on a query is nested, then this guardrail would not be effective. For example what if the filter was 
   
   ```
   dim1='one' AND dim2 IN (1,2,3,4,5)
   ```
   
   In this case, it looks like the query would still run.
   
   Do you think walking the entire filter tree would be expensive at this point in the query plan? The other thing I'd like to know more about is how `DimFilter#optimize` plays into this guardrail - is the query filter that is being used here before or after the optimize function is called? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org