You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/02/26 21:48:04 UTC

[GitHub] [druid] sascha-coenen opened a new issue #9413: SQL: BLOOM_FILTER() does not work on constant values fed in via sub query

sascha-coenen opened a new issue #9413: SQL: BLOOM_FILTER() does not work on constant values fed in via sub query
URL: https://github.com/apache/druid/issues/9413
 
 
   ### Affected Version
   0.16.0 and 0.17.0
   
   ### Description
   Sometimes users need to feed large lists of tokens to an IN filter which slows down queries, so I was trying to rewrite such queries so that a bloom filter would capture the filter conditions.
   I found out that the bloom_filter() function only seems to work if applied directly to a table column.
   However, if constant values from a subquery get passed in, the SQL query cannot be transformed into a native JSON query due to rule exceptions.
   
   The following set of queries demonstrate at which point things break:
   
   `VALUES (1,2,3), (4,5,6), (7,8,9)`
   works
   
   `VALUES 'a', 'b', 'c'`
   works
   
   `SELECT blacklist.item FROM (VALUES 'a', 'b', 'c') AS blacklist (item)`
   works
   
   `SELECT COUNT(blacklist.item) FROM (VALUES 'a', 'b', 'c') AS blacklist (item)`
   works
   
   `SELECT BLOOM_FILTER(blacklist.item, 100) FROM (VALUES 'a', 'b', 'c') AS blacklist (item)`
   fails with the exception below.
   This is stange because the above query succeeds although the only difference is that another aggregation function got used, namely COUNT instead of BLOOM_FILTER but the structure of both queries is identical.
   
   The following works fine:
   
   `SELECT BLOOM_FILTER(CountryCode, 100) FROM revenue_statistics WHERE __time > TIMESTAMP '2019-01-01 00:00:00'`
   works
   
   exception:
   `SQL Error [00000]: Error -1 (00000) : Error while executing SQL "SELECT BLOOM_FILTER(blacklist.item, 2) FROM (VALUES 'a', 'b', 'c') AS blacklist (item)": Remote driver error: RuntimeException: org.apache.calcite.plan.RelOptPlanner$CannotPlanException: There are not enough rules to produce a node with desired properties: convention=DRUID, sort=[].¶Missing conversion is LogicalAggregate[convention: NONE -> DRUID]¶There is 1 empty subset: rel#519:Subset#2.DRUID.[], the relevant part of the original plan is as follows¶517:LogicalAggregate(group=[{}], EXPR$0=[BLOOM_FILTER($0, $1)])¶  515:LogicalProject(subset=[rel#516:Subset#1.NONE.[0]], item=[$0], $f1=[2])¶    506:LogicalValues(subset=[rel#514:Subset#0.NONE.[0]], tuples=[[{ 'a' }, { 'b' }, { 'c' }]])¶¶Root: rel#519:Subset#2.DRUID.[]¶Original rel:¶LogicalAggregate(subset=[rel#519:Subset#2.DRUID.[]], group=[{}], EXPR$0=[BLOOM_FILTER($0, $1)]): rowcount = 1.0, cumulative cost = {1.125 rows, 0.0 cpu, 0.0 io}, id = 517¶  LogicalProject(subset=[rel#516:Subset#1.NONE.[0]], item=[$0], $f1=[2]): rowcount = 3.0, cumulative cost = {3.0 rows, 6.0 cpu, 0.0 io}, id = 515¶    LogicalValues(subset=[rel#514:Subset#0.NONE.[0]], tuples=[[{ 'a' }, { 'b' }, { 'c' }]]): rowcount = 3.0, cumulative cost = {3.0 rows, 1.0 cpu, 0.0 io}, id = 506¶¶Sets:¶Set#0, type: RecordType(CHAR(1) item)¶ rel#514:Subset#0.NONE.[0], best=null, importance=0.7290000000000001¶  rel#506:LogicalValues.NONE.[0](type=RecordType(CHAR(1) item),tuples=[{ 'a' }, { 'b' }, { 'c' }]), rowcount=3.0, cumulative cost={inf}¶Set#1, type: RecordType(CHAR(1) item, INTEGER $f1)¶ rel#516:Subset#1.NONE.[0], best=null, importance=0.81¶  rel#515:LogicalProject.NONE.[0](input=RelSubset#514,item=$0,$f1=2), rowcount=3.0, cumulative cost={inf}¶  rel#521:LogicalValues.NONE.[[0, 1], [1]](type=RecordType(CHAR(1) item, INTEGER $f1),tuples=[{ 'a', 2 }, { 'b', 2 }, { 'c', 2 }]), rowcount=3.0, cumulative cost={inf}¶ rel#522:Subset#1.NONE.[], best=null, importance=0.405¶  rel#515:LogicalProject.NONE.[0](input=RelSubset#514,item=$0,$f1=2), rowcount=3.0, cumulative cost={inf}¶  rel#521:LogicalValues.NONE.[[0, 1], [1]](type=RecordType(CHAR(1) item, INTEGER $f1),tuples=[{ 'a', 2 }, { 'b', 2 }, { 'c', 2 }]), rowcount=3.0, cumulative cost={inf}¶Set#2, type: RecordType(OTHER EXPR$0)¶ rel#518:Subset#2.NONE.[], best=null, importance=0.9¶  rel#517:LogicalAggregate.NONE.[](input=RelSubset#516,group={},EXPR$0=BLOOM_FILTER($0, $1)), rowcount=1.0, cumulative cost={inf}¶ rel#519:Subset#2.DRUID.[], best=null, importance=1.0¶  rel#520:AbstractConverter.DRUID.[](input=RelSubset#518,convention=DRUID,sort=[]), rowcount=1.0, cumulative cost={inf}¶¶Graphviz:¶digraph G {¶ root [style=filled,label="Root"];¶ subgraph cluster0{¶  label="Set 0 RecordType(CHAR(1) item)";¶  rel506 [label="rel#506:LogicalValues\ntype=RecordType(CHAR(1) item),tuples=[{ 'a' }, { 'b' }, { 'c' }]\nrows=3.0, cost={inf}",shape=box]¶  subset514 [label="rel#514:Subset#0.NONE.[0]"]¶ }¶ subgraph cluster1{¶  label="Set 1 RecordType(CHAR(1) item, INTEGER $f1)";¶  rel515 [label="rel#515:LogicalProject\ninput=RelSubset#514,item=$0,$f1=2\nrows=3.0, cost={inf}",shape=box]¶  rel521 [label="rel#521:LogicalValues.NONE.[[0, 1], [1]]\ntype=RecordType(CHAR(1) item, INTEGER $f1),tuples=[{ 'a', 2 }, { 'b', 2 }, { 'c', 2 }]\nrows=3.0, cost={inf}",shape=box]¶  subset516 [label="rel#516:Subset#1.NONE.[0]"]¶  subset522 [label="rel#522:Subset#1.NONE.[]"]¶  subset522 -> subset516; }¶ subgraph cluster2{¶  label="Set 2 RecordType(OTHER EXPR$0)";¶  rel517 [label="rel#517:LogicalAggregate\ninput=RelSubset#516,group={},EXPR$0=BLOOM_FILTER($0, $1)\nrows=1.0, cost={inf}",shape=box]¶  rel520 [label="rel#520:AbstractConverter\ninput=RelSubset#518,convention=DRUID,sort=[]\nrows=1.0, cost={inf}",shape=box]¶  subset518 [label="rel#518:Subset#2.NONE.[]"]¶  subset519 [label="rel#519:Subset#2.DRUID.[]",color=red]¶ }¶ root -> subset519;¶ subset514 -> rel506;¶ subset516 -> rel515; rel515 -> subset514;¶ subset522 -> rel521;¶ subset518 -> rel517; rel517 -> subset516;¶ subset519 -> rel520; rel520 -> subset518;¶} -> CannotPlanException: There are not enough rules to produce a node with desired properties: convention=DRUID, sort=[].¶Missing conversion is LogicalAggregate[convention: NONE -> DRUID]¶There is 1 empty subset: rel#519:Subset#2.DRUID.[], the relevant part of the original plan is as follows¶517:LogicalAggregate(group=[{}], EXPR$0=[BLOOM_FILTER($0, $1)])¶  515:LogicalProject(subset=[rel#516:Subset#1.NONE.[0]], item=[$0], $f1=[2])¶    506:LogicalValues(subset=[rel#514:Subset#0.NONE.[0]], tuples=[[{ 'a' }, { 'b' }, { 'c' }]])¶¶Root: rel#519:Subset#2.DRUID.[]¶Original rel:¶LogicalAggregate(subset=[rel#519:Subset#2.DRUID.[]], group=[{}], EXPR$0=[BLOOM_FILTER($0, $1)]): rowcount = 1.0, cumulative cost = {1.125 rows, 0.0 cpu, 0.0 io}, id = 517¶  LogicalProject(subset=[rel#516:Subset#1.NONE.[0]], item=[$0], $f1=[2]): rowcount = 3.0, cumulative cost = {3.0 rows, 6.0 cpu, 0.0 io}, id = 515¶    LogicalValues(subset=[rel#514:Subset#0.NONE.[0]], tuples=[[{ 'a' }, { 'b' }, { 'c' }]]): rowcount = 3.0, cumulative cost = {3.0 rows, 1.0 cpu, 0.0 io}, id = 506¶¶Sets:¶Set#0, type: RecordType(CHAR(1) item)¶ rel#514:Subset#0.NONE.[0], best=null, importance=0.7290000000000001¶  rel#506:LogicalValues.NONE.[0](type=RecordType(CHAR(1) item),tuples=[{ 'a' }, { 'b' }, { 'c' }]), rowcount=3.0, cumulative cost={inf}¶Set#1, type: RecordType(CHAR(1) item, INTEGER $f1)¶ rel#516:Subset#1.NONE.[0], best=null, importance=0.81¶  rel#515:LogicalProject.NONE.[0](input=RelSubset#514,item=$0,$f1=2), rowcount=3.0, cumulative cost={inf}¶  rel#521:LogicalValues.NONE.[[0, 1], [1]](type=RecordType(CHAR(1) item, INTEGER $f1),tuples=[{ 'a', 2 }, { 'b', 2 }, { 'c', 2 }]), rowcount=3.0, cumulative cost={inf}¶ rel#522:Subset#1.NONE.[], best=null, importance=0.405¶  rel#515:LogicalProject.NONE.[0](input=RelSubset#514,item=$0,$f1=2), rowcount=3.0, cumulative cost={inf}¶  rel#521:LogicalValues.NONE.[[0, 1], [1]](type=RecordType(CHAR(1) item, INTEGER $f1),tuples=[{ 'a', 2 }, { 'b', 2 }, { 'c', 2 }]), rowcount=3.0, cumulative cost={inf}¶Set#2, type: RecordType(OTHER EXPR$0)¶ rel#518:Subset#2.NONE.[], best=null, importance=0.9¶  rel#517:LogicalAggregate.NONE.[](input=RelSubset#516,group={},EXPR$0=BLOOM_FILTER($0, $1)), rowcount=1.0, cumulative cost={inf}¶ rel#519:Subset#2.DRUID.[], best=null, importance=1.0¶  rel#520:AbstractConverter.DRUID.[](input=RelSubset#518,convention=DRUID,sort=[]), rowcount=1.0, cumulative cost={inf}¶¶Graphviz:¶digraph G {¶ root [style=filled,label="Root"];¶ subgraph cluster0{¶  label="Set 0 RecordType(CHAR(1) item)";¶  rel506 [label="rel#506:LogicalValues\ntype=RecordType(CHAR(1) item),tuples=[{ 'a' }, { 'b' }, { 'c' }]\nrows=3.0, cost={inf}",shape=box]¶  subset514 [label="rel#514:Subset#0.NONE.[0]"]¶ }¶ subgraph cluster1{¶  label="Set 1 RecordType(CHAR(1) item, INTEGER $f1)";¶  rel515 [label="rel#515:LogicalProject\ninput=RelSubset#514,item=$0,$f1=2\nrows=3.0, cost={inf}",shape=box]¶  rel521 [label="rel#521:LogicalValues.NONE.[[0, 1], [1]]\ntype=RecordType(CHAR(1) item, INTEGER $f1),tuples=[{ 'a', 2 }, { 'b', 2 }, { 'c', 2 }]\nrows=3.0, cost={inf}",shape=box]¶  subset516 [label="rel#516:Subset#1.NONE.[0]"]¶  subset522 [label="rel#522:Subset#1.NONE.[]"]¶  subset522 -> subset516; }¶ subgraph cluster2{¶  label="Set 2 RecordType(OTHER EXPR$0)";¶  rel517 [label="rel#517:LogicalAggregate\ninput=RelSubset#516,group={},EXPR$0=BLOOM_FILTER($0, $1)\nrows=1.0, cost={inf}",shape=box]¶  rel520 [label="rel#520:AbstractConverter\ninput=RelSubset#518,convention=DRUID,sort=[]\nrows=1.0, cost={inf}",shape=box]¶  subset518 [label="rel#518:Subset#2.NONE.[]"]¶  subset519 [label="rel#519:Subset#2.DRUID.[]",color=red]¶ }¶ root -> subset519;¶ subset514 -> rel506;¶ subset516 -> rel515; rel515 -> subset514;¶ subset522 -> rel521;¶ subset518 -> rel517; rel517 -> subset516;¶ subset519 -> rel520; rel520 -> subset518;¶}`
   
   
   
   
   
   
   
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on issue #9413: SQL: BLOOM_FILTER() does not work on constant values fed in via sub query

Posted by GitBox <gi...@apache.org>.
gianm commented on issue #9413: SQL: BLOOM_FILTER() does not work on constant values fed in via sub query
URL: https://github.com/apache/druid/issues/9413#issuecomment-591675746
 
 
   Treating this as a feature request for `FROM (VALUES ...)` support in the SQL-to-native-Druid-query logic. I think that would make @sascha-coenen's query work out.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] sascha-coenen commented on issue #9413: SQL: BLOOM_FILTER() does not work on constant values fed in via sub query

Posted by GitBox <gi...@apache.org>.
sascha-coenen commented on issue #9413: SQL: BLOOM_FILTER() does not work on constant values fed in via sub query
URL: https://github.com/apache/druid/issues/9413#issuecomment-591703412
 
 
   thanks for the explanations.
   
   What I did not understand is that Druid can't handle FROM (VALUES ...) yet.
   It seems to work at least partially as the following query works fine:
   `
   SELECT COUNT(blacklist.item) FROM (VALUES 'a', 'b', 'c') AS blacklist (item)
   `
   
   > meaning all Druid SQL queries will actually run as Druid native queries
   sounds definitely interesting although I'm not sure whether I understand correctly.
   Does it mean that historicals would support receiving SQL queries directly and being able to execute them without having to rewrite them to JSON queries first? And that can be done without calcite? wow

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on issue #9413: SQL: BLOOM_FILTER() does not work on constant values fed in via sub query

Posted by GitBox <gi...@apache.org>.
gianm commented on issue #9413: SQL: BLOOM_FILTER() does not work on constant values fed in via sub query
URL: https://github.com/apache/druid/issues/9413#issuecomment-591705590
 
 
   > sounds definitely interesting although I'm not sure whether I understand correctly.
   > Does it mean that historicals would support receiving SQL queries directly and being able to execute them without having to rewrite them to JSON queries first? And that can be done without calcite? wow
   
   What I meant was that today there are two ways that Druid SQL queries can run:
   
   - If they refer to Druid datasources then they will be translated to native Druid queries and run that way.
   - If they do not refer to Druid datasources, they run using Calcite's interpreter. This includes any queries on `sys` or `INFORMATION_SCHEMA` tables, and any queries on raw `VALUES`.
   
   It is a bit weird to have two paths like this, and we'd like to standardize on always translating SQL to native Druid queries (the JSON ones). This means native Druid queries need to start supporting functionality like `sys`, `INFORMATION_SCHEMA`, `VALUES`, etc. It is coming but we aren't quite there yet.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] sascha-coenen edited a comment on issue #9413: SQL: BLOOM_FILTER() does not work on constant values fed in via sub query

Posted by GitBox <gi...@apache.org>.
sascha-coenen edited a comment on issue #9413: SQL: BLOOM_FILTER() does not work on constant values fed in via sub query
URL: https://github.com/apache/druid/issues/9413#issuecomment-591703412
 
 
   thanks for the explanations.
   
   What I did not understand is that Druid can't handle FROM (VALUES ...) yet.
   It seems to work at least partially as the following query works fine:
   `
   SELECT COUNT(blacklist.item) FROM (VALUES 'a', 'b', 'c') AS blacklist (item)
   `
   
   > meaning all Druid SQL queries will actually run as Druid native queries
   
   sounds definitely interesting although I'm not sure whether I understand correctly.
   Does it mean that historicals would support receiving SQL queries directly and being able to execute them without having to rewrite them to JSON queries first? And that can be done without calcite? wow

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on issue #9413: SQL: BLOOM_FILTER() does not work on constant values fed in via sub query

Posted by GitBox <gi...@apache.org>.
gianm commented on issue #9413: SQL: BLOOM_FILTER() does not work on constant values fed in via sub query
URL: https://github.com/apache/druid/issues/9413#issuecomment-591675356
 
 
   I think the real issue here is that the SQL-to-native-Druid-query logic currently can't handle `FROM (VALUES ...)`. I bet it works in the other cases because you aren't referencing any Druid tables or Druid functions, so the SQL is running in Calcite's interpreter rather than as an actual Druid query.
   
   Btw, in a future release, I'd like to remove our usage of Calcite's interpreter, meaning all Druid SQL queries will actually run as Druid native queries. This will make things more coherent.
   
   Separately, it would also be good for us to add `FROM (VALUES ...)` support.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on issue #9413: SQL: BLOOM_FILTER() does not work on constant values fed in via sub query

Posted by GitBox <gi...@apache.org>.
gianm commented on issue #9413: SQL: BLOOM_FILTER() does not work on constant values fed in via sub query
URL: https://github.com/apache/druid/issues/9413#issuecomment-591676224
 
 
   Btw, as a workaround, you could generate the bloom filter yourself and pass it to `BLOOM_FILTER_TEST`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org