You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/02/26 21:34:38 UTC

[GitHub] [druid] sascha-coenen opened a new issue #9412: Regression: SQL query with multiple similar CASE clauses cannot be translated to correct native json format

sascha-coenen opened a new issue #9412: Regression: SQL query with multiple similar CASE clauses cannot be translated to correct native json format
URL: https://github.com/apache/druid/issues/9412
 
 
   ### Affected Version
   v0.17.0
   This is a serious regression bug for us. Queries like the following used to work in Druid 0.16.0 and are broken in 0.17.0
   We used Druid 0.17.0 with SQL compatible null handling DISABLED.
   
   ### Description
   
   The following query produces a rule error in the SQL engine:
   ```
   SELECT 
   	CASE cityName WHEN NULL THEN FALSE ELSE TRUE END AS col_a,
   	CASE countryName WHEN NULL THEN FALSE ELSE TRUE END AS col_b
   FROM wikipedia
   GROUP BY 1, 2
   ```
   
   the error is:
   
   ```
   Unknown exception / Error while applying rule DruidQueryRule(AGGREGATE), args [rel#592:LogicalAggregate.NONE.[](input=RelSubset#591,group={0, 1}), rel#599:DruidQueryRel.NONE.[](query={"queryType":"scan","dataSource":{"type":"table","name":"wikipedia"},"intervals":{"type":"intervals","intervals":["-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"]},"virtualColumns":[{"type":"expression","name":"v0","expression":"1","outputType":"LONG"}],"resultFormat":"compactedList","batchSize":20480,"limit":9223372036854775807,"order":"none","filter":null,"columns":["v0"],"legacy":false,"context":{"forceLimitPushDown":"false","sqlOuterLimit":100,"sqlQueryId":"f83cf8c0-1856-4dc9-aeb6-fc487b92750f","useApproximateTopN":"false","vectorize":"false"},"descending":false,"granularity":{"type":"all"}},signature={v0:LONG, v0:LONG})] / java.lang.RuntimeException
   ```
   
   Following is a comparison between what a similar query got translated to in Druid 0.16.0 vs now
   
   in 0.16.0:
   ```
   {
     "queryType": "groupBy",
   ...
     "virtualColumns": [
       {
         "type": "expression",
         "name": "v0",
         "expression": "case_searched((\"integrationType\" == null),0,1)",
         "outputType": "LONG"
       },
       {
         "type": "expression",
         "name": "v1",
         "expression": "case_searched((\"integrationVersion\" == null),0,1)",
         "outputType": "LONG"
       }
     ],
     "dimensions": [
       {
         "type": "default",
         "dimension": "v0",
         "outputName": "v0",
         "outputType": "LONG"
       },
       {
         "type": "default",
         "dimension": "v1",
         "outputName": "v1",
         "outputType": "LONG"
       }
     ],
   ...
   }
   ```
   
   
   In v 0.17.0, there seem to be several things that compound:
   a) the query is rewritten into a scan query although it augh to be a groupby query
   b) the main issue seems to be that instead of recognizing the two CASE statements as being different and generating different logical names "v0" and "v1" for them, the two statements are being given the same name "v0" which is then consecutively leading to an illegal query syntax.
   c) instead of faithful CASE expressions, a virtual column expression gets generated that is just a constant number "1" 
   
   ```
   {
      "queryType":"scan",
   ...
      "virtualColumns":[
         {
            "type":"expression",
            "name":"v0",
            "expression":"1",
            "outputType":"LONG"
         }
      ],
      "columns":[
         "v0"
      ],
   ...
   }
   ```
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] sascha-coenen commented on issue #9412: Regression: SQL query with multiple similar CASE clauses cannot be translated to correct native json format

Posted by GitBox <gi...@apache.org>.
sascha-coenen commented on issue #9412: Regression: SQL query with multiple similar CASE clauses cannot be translated to correct native json format
URL: https://github.com/apache/druid/issues/9412#issuecomment-591923866
 
 
   AH, right! 
   Thanks now I get it. I remember that NULL IS NULL evaluates to true while NULL = NULL does not and it seems that CASE X WHEN NULL THEN would apply an equals comparison and therefore not evaluate to true ever.
   
   I'm sorry but to round this off one further set of example queries that are using the correct syntax are needed:
   
   the following query works:
   ```
   SELECT 
   	CASE WHEN cityName IS NULL THEN FALSE ELSE TRUE END AS col_a,
   	CASE WHEN countryName IS NULL THEN FALSE ELSE TRUE END AS col_b
   FROM wikipedia
   GROUP BY 1, 2
   ```
   
   the following query fails. (Duplicate field name: v0)
   The only difference is the wrapping of the fields inside the REGEXP_EXTRACT function. 
   
   ```
   SELECT 
   	CASE WHEN REGEXP_EXTRACT(cityName, '.a.*') IS NULL THEN FALSE ELSE TRUE END AS col_a,
   	CASE WHEN REGEXP_EXTRACT(countryName, '.b.*') IS NULL THEN FALSE ELSE TRUE END AS col_b
   FROM wikipedia
   GROUP BY 1, 2
   
   ```
   
   This is just to have a perhaps best final test fixture that is working on the wikipedia dataset and is using the right CASE statement. 
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on issue #9412: Regression: SQL query with multiple similar CASE clauses cannot be translated to correct native json format

Posted by GitBox <gi...@apache.org>.
clintropolis commented on issue #9412: Regression: SQL query with multiple similar CASE clauses cannot be translated to correct native json format
URL: https://github.com/apache/druid/issues/9412#issuecomment-591674647
 
 
   Thanks for the report, it looks like a couple of things are going on here.
   
   I think the new in 0.17 behavior is that _both_ of the expressions are being optimized into "true", my guess calcite optimizing something due to SQL lack of null equivalence means only the else is possible. I haven't confirmed yet, but I suspect #8566 might be related.
   
   Try this instead:
   
   ```
   SELECT 
   	CASE WHEN cityName IS NULL THEN FALSE ELSE TRUE END AS col_a,
   	CASE WHEN countryName IS NULL THEN FALSE ELSE TRUE END AS col_b
   FROM wikipedia
   GROUP BY 1, 2
   ```
    
   Anyway, the optimization calcite was doing uncovers another bug, which I think might be caused by the changes of #6902, which refactored druid sql to re-use virtual columns for the same expressions, but this appears to make grouping sad if both columns that are now the same virtual column are being grouped on.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis commented on issue #9412: Regression: SQL query with multiple similar CASE clauses cannot be translated to correct native json format

Posted by GitBox <gi...@apache.org>.
clintropolis commented on issue #9412: Regression: SQL query with multiple similar CASE clauses cannot be translated to correct native json format
URL: https://github.com/apache/druid/issues/9412#issuecomment-596848067
 
 
   Closing this as #9429 has been merged which fixes the bug discovered here.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm edited a comment on issue #9412: Regression: SQL query with multiple similar CASE clauses cannot be translated to correct native json format

Posted by GitBox <gi...@apache.org>.
gianm edited a comment on issue #9412: Regression: SQL query with multiple similar CASE clauses cannot be translated to correct native json format
URL: https://github.com/apache/druid/issues/9412#issuecomment-591689543
 
 
   To be clear, the query @clintropolis is the correct way to write what you're trying to write. Starting in Druid 0.17.0 it (correctly) recognizes that `CASE cityName WHEN NULL THEN FALSE ELSE TRUE END AS col_a` is always true, because `NULL` does not equal `NULL`, so even if `cityName` is NULL, it still won't go down the `WHEN NULL` path.
   
   PostgreSQL would do the same thing (I just tested it to double-check).
   
   There is a bug, though, where the SQL planner is trying to create two virtual columns with the same name. We'll need to fix that. But it shouldn't block you, @sascha-coenen, because you should adjust your query to the variant @clintropolis suggested anyway (it's better aligned with how SQL works, and should be effective in 0.17.0 and all future versions).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] sascha-coenen commented on issue #9412: Regression: SQL query with multiple similar CASE clauses cannot be translated to correct native json format

Posted by GitBox <gi...@apache.org>.
sascha-coenen commented on issue #9412: Regression: SQL query with multiple similar CASE clauses cannot be translated to correct native json format
URL: https://github.com/apache/druid/issues/9412#issuecomment-591700246
 
 
   thanks for the feedback.
   wrt to the CASE cityName evaluating to NULL:
   The origiinal query that failed for us looks like this:
   
   ```
   SELECT 
   	TIME_FORMAT(__time, 'yyyy-MM') AS "period",
   	CASE REGEXP_EXTRACT(integrationVersion, 'sdkandroid 2[0-9]\..*') WHEN NULL THEN FALSE ELSE TRUE END AS is_nextgen_android,
   	CASE REGEXP_EXTRACT(integrationVersion, 'sdkios 2[0-9]\..*') WHEN NULL THEN FALSE ELSE TRUE END AS is_nextgen_ios,
   	integrationType AS sdk,
   	integrationVersion AS sdk_version,
   	ROUND(SUM(accountableRevenue)) AS revenue,
   	ROUND(SUM("count")) AS soma_services_performed
   FROM "supply-activities-inv"
   WHERE __time >= '2020-01-01 00:00:00'
   	AND __time < '2020-02-01 00:00:00'
   	AND multiplierId NOT IN ('2000', '200')
   	AND activityTypeId IN ('o', 'u', 'p', 'f')
   	AND integrationType IN ('sdkandroid', 'sdkios', 'ubsdkandroid', 'ubsdkios')
   GROUP BY 1, 2, 3, 4, 5
   ```
   
   This query also fails with the same exception, namely that v0 is being used twice, but in the above CASE statements, I believe that calcite cannot infer that the CASE condition is always null.
   I was trying to simplify this query to the point where I would be able to pinpoint what's going on.
   
   Interesitngly, it seems to also have to do with the ELSE clause being the same. For instance consider the followiing two queries. The first will fail the second will work:
   
   
   ```
   SELECT 
   	CASE cityName WHEN NULL THEN FALSE ELSE TRUE END AS col_a,
   	CASE countryName WHEN NULL THEN FALSE ELSE TRUE END AS col_b
   FROM wikipedia
   GROUP BY 1, 2
   ```
   FAILS
   
   ```
   SELECT 
   	CASE cityName WHEN NULL THEN FALSE ELSE TRUE END AS col_a,
   	CASE countryName WHEN NULL THEN TRUE ELSE FALSE END AS col_b
   FROM wikipedia
   GROUP BY 1, 2
   ```
   WORKS!
   
   The only difference is that the TRUE/FALSE in the second case statement is swapped. Now that the two projections don't both evaluate to the same value, the query works because I guess it cannot reduce the expressions down to a common one.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on issue #9412: Regression: SQL query with multiple similar CASE clauses cannot be translated to correct native json format

Posted by GitBox <gi...@apache.org>.
gianm commented on issue #9412: Regression: SQL query with multiple similar CASE clauses cannot be translated to correct native json format
URL: https://github.com/apache/druid/issues/9412#issuecomment-592074518
 
 
   > This is just to have a perhaps best final test fixture that is working on the wikipedia dataset and is using the right CASE statement.
   
   Thanks, that's helpful!
   
   It looks like @clintropolis is already working on a fix in #9429. I'm going to go take a look at that patch.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] clintropolis closed issue #9412: Regression: SQL query with multiple similar CASE clauses cannot be translated to correct native json format

Posted by GitBox <gi...@apache.org>.
clintropolis closed issue #9412: Regression: SQL query with multiple similar CASE clauses cannot be translated to correct native json format
URL: https://github.com/apache/druid/issues/9412
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on issue #9412: Regression: SQL query with multiple similar CASE clauses cannot be translated to correct native json format

Posted by GitBox <gi...@apache.org>.
gianm commented on issue #9412: Regression: SQL query with multiple similar CASE clauses cannot be translated to correct native json format
URL: https://github.com/apache/druid/issues/9412#issuecomment-591689543
 
 
   To be clear, the query @clintropolis is the correct way to write what you're trying to write. Starting in Druid 0.17.0 it (correctly) recognizes that `CASE cityName WHEN NULL THEN FALSE ELSE TRUE END AS col_a` is always true, because `NULL` does not equal `NULL`, so even if `cityName` is NULL, it still shouldn't go down the `WHEN NULL` path.
   
   PostgreSQL would do the same thing (I just tested it to double-check).
   
   There is a bug, though, where the SQL planner is trying to create two virtual columns with the same name. We'll need to fix that. But it shouldn't block you, @sascha-coenen, because you should adjust your query to the variant @clintropolis suggested anyway (it's better aligned with how SQL works, and should be effective in 0.17.0 and all future versions).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on issue #9412: Regression: SQL query with multiple similar CASE clauses cannot be translated to correct native json format

Posted by GitBox <gi...@apache.org>.
gianm commented on issue #9412: Regression: SQL query with multiple similar CASE clauses cannot be translated to correct native json format
URL: https://github.com/apache/druid/issues/9412#issuecomment-591702177
 
 
   I think you're right that them reducing to the same expression has something to do with the error. Fwiw, in the SQL standard, `CASE X WHEN NULL` is _always_ false for _any_ X. Calcite detects that and just replaces it with the `ELSE` path (it doesn't care what X is, nor should it).
   
   For your case, you should just replace `CASE X WHEN NULL THEN ... ELSE ... END` with `CASE WHEN X IS NULL THEN ... ELSE ... END`, and then you should get the behavior you're looking for.
   
   Separately, we need to fix this bug, but when we do, it will just mean that the `ELSE` path is (correctly) always taken.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org