You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/02/26 22:00:54 UTC

[GitHub] [druid] sascha-coenen opened a new issue #9414: SQL: severe performance degradation if multi-valued dimension gets combined with a LOOKUP and NVL function

sascha-coenen opened a new issue #9414: SQL: severe performance degradation if multi-valued dimension gets combined with a LOOKUP and NVL function
URL: https://github.com/apache/druid/issues/9414
 
 
   ### Affected Version
   0.16.0 and 0.17.0
   
   ### Description
   We often use the combination of NULLIF() and LOOKUP() functions in combination. This works fine for normal dimension columns, but when using a multi-valued dimension, then the following combination of NULLIF and LOOKUP leads to extremely high segment scan times of 30 seconds or even up to several minutes:
   
   ```
   SELECT
      NULLIF(LOOKUP(adTypeIdPubRequest, 'adtype_by_adtypeid'), 'fallback') AS adFormat,
      SUM("count") AS cnt
   FROM "supply-activities" 
   WHERE  (__time >= timestamp'2020-01-20 00:00:00' AND __time < timestamp '2020-01-20 01:00:00')
   GROUP BY 1
   ORDER BY 2
   LIMIT 10
   ```
   
   In contrast, the following query which is merely dropping the surrounding NULLIF() function is executing at normal speed:
   
   ```
   SELECT
      LOOKUP(adTypeIdPubRequest, 'adtype_by_adtypeid') AS adFormat,
      SUM("count") AS cnt
   FROM "supply-activities" 
   WHERE  (__time >= timestamp'2020-01-20 00:00:00' AND __time < timestamp '2020-01-20 01:00:00')
   GROUP BY 1
   ORDER BY 2
   ```
   
   Using NULLIF in combination with LOOKUP on normal dinensions does NOT lead to a performance issue.
   Prior to Druid 0.16.0 we did not notice any performance issue but cannot guarentee that it was introduced with 0.16 because we might have simply overlooked it in prior releases.
   With 0.16.0 and also 0.17.0 the performance issue exists. We tested with SQL compatible null handling turned on and off. Performance is bad in both cases. As mentioned, the segment scan times are off the charts.
   
   --
   
   I could not demonstrate the issue using the wikipedia dataset because it doesn't seem to contain any multi-valued dimensions. Perhaps it would be good to have a normative dataset which showcases all field types such that it can be used for problem reports that can be reproduced?
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on issue #9414: SQL: severe performance degradation if multi-valued dimension gets combined with a LOOKUP and NVL function

Posted by GitBox <gi...@apache.org>.
gianm commented on issue #9414: SQL: severe performance degradation if multi-valued dimension gets combined with a LOOKUP and NVL function
URL: https://github.com/apache/druid/issues/9414#issuecomment-596667132
 
 
   This should be fixed by #9425 (without need for the mentioned workaround).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on issue #9414: SQL: severe performance degradation if multi-valued dimension gets combined with a LOOKUP and NVL function

Posted by GitBox <gi...@apache.org>.
gianm commented on issue #9414: SQL: severe performance degradation if multi-valued dimension gets combined with a LOOKUP and NVL function
URL: https://github.com/apache/druid/issues/9414#issuecomment-591681808
 
 
   There is an important optimization that gets applied for string expressions that read just one column, in which expression evaluation is deferred until after the segment scan. I'm not sure why, but in your case, it's being applied for `LOOKUP(...)` and not for `NULLIF(LOOKUP(...), ...)`.
   
   It probably has to do with the fact that the column is referenced twice and the fact that the input is multi-value (this SQL expression will become `case_searched((lookup("adTypeIdPubRequest",'adtype_by_adtypeid') == 'fallback'),null,lookup("adTypeIdPubRequest",'adtype_by_adtypeid'))` as a native expression). The logic for performing the optimization might be getting confused.
   
   Try this workaround:
   
   ```
   SELECT
      NULLIF(LOOKUP(adTypeIdPubRequest, 'adtype_by_adtypeid'), 'fallback') AS adFormat,
      SUM("count") AS cnt
   FROM "supply-activities" 
   WHERE  (__time >= timestamp'2020-01-20 00:00:00' AND __time < timestamp '2020-01-20 01:00:00')
   GROUP BY LOOKUP(adTypeIdPubRequest, 'adtype_by_adtypeid')
   ORDER BY 2
   LIMIT 10
   ```
   
   It's not exactly an identical query, but it will run faster, and might be close enough to what you want.
   
   Btw, I think that this was always an issue for expressions on multi-value columns as long as they've worked at all (they used to just always return null). So I think it's not a regression.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] sascha-coenen commented on issue #9414: SQL: severe performance degradation if multi-valued dimension gets combined with a LOOKUP and NVL function

Posted by GitBox <gi...@apache.org>.
sascha-coenen commented on issue #9414: SQL: severe performance degradation if multi-valued dimension gets combined with a LOOKUP and NVL function
URL: https://github.com/apache/druid/issues/9414#issuecomment-591704834
 
 
   Thanks for the explanations and the support.
   Just to confirm: the workaround you suggested is indeed executing very well. Thanks a ton!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm closed issue #9414: SQL: severe performance degradation if multi-valued dimension gets combined with a LOOKUP and NVL function

Posted by GitBox <gi...@apache.org>.
gianm closed issue #9414: SQL: severe performance degradation if multi-valued dimension gets combined with a LOOKUP and NVL function
URL: https://github.com/apache/druid/issues/9414
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on issue #9414: SQL: severe performance degradation if multi-valued dimension gets combined with a LOOKUP and NVL function

Posted by GitBox <gi...@apache.org>.
gianm commented on issue #9414: SQL: severe performance degradation if multi-valued dimension gets combined with a LOOKUP and NVL function
URL: https://github.com/apache/druid/issues/9414#issuecomment-591685291
 
 
   It looks like this is a bug in the check for when it's ok to apply this optimization. It's being too strict and not applying it, even though it could.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] sascha-coenen commented on issue #9414: SQL: severe performance degradation if multi-valued dimension gets combined with a LOOKUP and NVL function

Posted by GitBox <gi...@apache.org>.
sascha-coenen commented on issue #9414: SQL: severe performance degradation if multi-valued dimension gets combined with a LOOKUP and NVL function
URL: https://github.com/apache/druid/issues/9414#issuecomment-596845482
 
 
   This is AWESOME. Thanks a lot to everyone for fixing this so quickly.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org