You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/02/26 22:51:00 UTC

[GitHub] [druid] sascha-coenen opened a new issue #9419: SQL: top-level sketch functions do not return sketches

sascha-coenen opened a new issue #9419: SQL: top-level sketch functions do not return sketches
URL: https://github.com/apache/druid/issues/9419
 
 
   ### Affected Version
   0.17.0
   
   ### Description
   The following query should be returning materialized sketches in the form of base64 encoded strings such that the results can be exported, forklifted and later on used in other contexts.
   The Druid documentation states that the following three functions would return sketches, but if they are used at top-level as in the following example, they either return counts or unique counts but no sketches:
   
   ```
   SELECT
   	DS_HLL(countryName),
   	DS_THETA(countryName),
   	DS_QUANTILES_SKETCH(countryName)
   FROM wikipedia
   ```
   This yields 
   105, 105.0, 2383
   
   The documentation states:
   
   > DS_HLL(expr, [lgK, tgtHllType]) | Creates an HLL sketch on the values of expr
   
   > DS_THETA(expr, [size]) | Creates a Theta sketch on the values of expr
   
   > DS_QUANTILES_SKETCH(expr, [k]) | Creates a Quantiles sketch on the values of expr
   
   
   
   
   
   
   
   
   
   
   
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] sascha-coenen commented on issue #9419: SQL: top-level sketch functions do not return sketches

Posted by GitBox <gi...@apache.org>.
sascha-coenen commented on issue #9419: SQL: top-level sketch functions do not return sketches
URL: https://github.com/apache/druid/issues/9419#issuecomment-591953898
 
 
   I forgot to add another interesting detail to this bug report, so I will update the description

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] sascha-coenen edited a comment on issue #9419: SQL: top-level sketch functions do not return sketches

Posted by GitBox <gi...@apache.org>.
sascha-coenen edited a comment on issue #9419: SQL: top-level sketch functions do not return sketches
URL: https://github.com/apache/druid/issues/9419#issuecomment-592413853
 
 
   > @vikramsinghchandel You can set the same context options in the JDBC connection properties: https://druid.apache.org/docs/latest/querying/sql.html#connection-context
   
   hmm, Vikram already pointed out that he is aware of this. His need is to be able to express such things per query: 
   > Now both of these systems **don't have per query level context settings** **they are set at JDBC driver level.**
   
   --
   
   > the complex aggs like sketches are intended to be "finalized" (give the estimate) unless the user explicitly sets finalize=false in the context, so the bug would be that it's returning the base64 encodings with the additional projections
   
   I respectfully disagree:
   
   There are three dimensions to what makes a good user experience
   a) sensible defaults
   b) consistency
   b) control
   
   wrt **sensible defaults**, I agree with you that the common case is to want to materialize a sketch rather than to look at its raw form and therefore finalizing a sketch would be the most sensible default. 
   
   wrt **consistency** however, for all other sketches the current set of Druid SQL UDFs are designed around separating the responsibilities of creating a sketch and then evaluating it. It is also the common case to have to do this for the simple reason that - by default - there is no 1:1 relationship between aggregations and post aggregations. (i.e. should a doubleSum() be post-aggregated with a LOG10 or a SQRT ?  should a quantile sketch be post-aggregated with a union/intersect/not ?)
   So all sketches have UDFs that 
   
   a) generate a sketch 
   e.g. BLOOM_FILTER, TDIGEST_GENERATE_SKETCH, DS_QUANTILES_SKETCH, DS_HLL, DS_THETA
   
   b) evalutate a sketch
   e.g. BLOOM_FILTER_TEST, TDIGEST_QUANTILE,  DS_GET_QUANTILES, HLL_SKETCH_ESTIMATE, THETA_SKETCH_ESTIMATE 
   
   Notice how there is a full correspondence such that each generative UDF also has at leat one evaluating UDF.
   
   wrt **control**, having a sensible default is okay as long as this doesn't take away control from the author in terms of being able to express his intention and having it carried out. If it is my intention to pull unfinalized sketches in order to forklift them from Druid and feed them into another system, then I need to be able to formulate this. Setting a query context is not letting me do this because this is not usually possible on a per-query case and it is also not explicit within the query that other users need to be able to read and understand. I cannot send a query to another user and be sure that it does what I want because some parts of what the query result would look like are formulated outside of the query (namely in the query context). This is a huge usability problem and I'm looking forward to writing a proposal for expressing such query hints within the SQL queries.
   
   -- 
   
   Whether sensible defaults trump consistency is a difficult question to answer. 
   Luckily, this tie has already been broken in this context because the set of evaluating UDFs all work both on sketches and on regular fields, so that they act as shortcuts: If all I want is to create a sketch and then to finalize it, then I can directly use the above set of evaluating sketch functions.
   
   Putting all of this together, it seems that one just needs to follow the same concept for the given HLL/THETA sketches, which means that their generative UDFs aught to always return the sketches and the evaluating UDFs should always returned finalized results and can act as a shortcut to short-circuit the generative UDFS.
   
   The DS_HLL and DS_THETA functions just deviate from the common approach taken for the other sketches which also is reflected in their documentation which in my opinion correctly formulates the intention that these generative functions should yield a sketch and not a finalized result.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jon-wei commented on issue #9419: SQL: top-level sketch functions do not return sketches

Posted by GitBox <gi...@apache.org>.
jon-wei commented on issue #9419: SQL: top-level sketch functions do not return sketches
URL: https://github.com/apache/druid/issues/9419#issuecomment-591688426
 
 
   You can set `"finalize":false` in the query context if you want the serialized sketches instead of the estimates

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jon-wei edited a comment on issue #9419: SQL: top-level sketch functions do not return sketches

Posted by GitBox <gi...@apache.org>.
jon-wei edited a comment on issue #9419: SQL: top-level sketch functions do not return sketches
URL: https://github.com/apache/druid/issues/9419#issuecomment-591688426
 
 
   You can set `"finalize": false` in the query context (https://druid.apache.org/docs/latest/querying/query-context.html) if you want the serialized sketches instead of the estimates

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] sascha-coenen commented on issue #9419: SQL: top-level sketch functions do not return sketches

Posted by GitBox <gi...@apache.org>.
sascha-coenen commented on issue #9419: SQL: top-level sketch functions do not return sketches
URL: https://github.com/apache/druid/issues/9419#issuecomment-592413853
 
 
   > @vikramsinghchandel You can set the same context options in the JDBC connection properties: https://druid.apache.org/docs/latest/querying/sql.html#connection-context
   
   hmm, Vikram already pointed out that he is aware of this. His need is to be able to express such things per query: 
   > Now both of these systems **don't have per query level context settings** **they are set at JDBC driver level.**
   
   
   > the complex aggs like sketches are intended to be "finalized" (give the estimate) unless the user explicitly sets finalize=false in the context, so the bug would be that it's returning the base64 encodings with the additional projections
   
   I respectfully disagree:
   
   There are three dimensions to what makes a good user experience
   a) sensible defaults
   b) consistency
   b) control
   
   wrt **sensible defaults**, I agree with you that the common case is to want to materialize a sketch rather than to look at its raw form and therefore finalizing a sketch would be the most sensible default. 
   
   wrt **consistency** however, for all other sketches the current set of Druid SQL UDFs are designed around separating the responsibilities of creating a sketch and then evaluating it. It is also the common case to have to do this for the simple reason that - by default - there is no 1:1 relationship between aggregations and post aggregations. (i.e. should a doubleSum() be post-aggregated with a LOG10 or a SQRT ?  should a quantile sketch be post-aggregated with a union/intersect/not ?)
   So all sketches have UDFs that 
   a) generate a sketch 
   e.g. BLOOM_FILTER, TDIGEST_GENERATE_SKETCH, DS_QUANTILES_SKETCH, DS_HLL, DS_THETA
   b) evalutate a sketch
   e.g. BLOOM_FILTER_TEST, TDIGEST_QUANTILE  
   
   wrt **control**, having a sensible default is okay as long as this doesn't take away control from the author in terms of being able to express his intention and having it carried out. If it is my intention to pull unfinalized sketches in order to forklift them from Druid and feed them into another system, then I need to be able to formulate this. Setting a query context is not letting me do this because this is not usually possible on a per-query case and it is also not explicit within the query that other users need to be able to read and understand. I cannot send a query to another user and be sure that it does what I want because some parts of what the query result would look like are formulated outside of the query (namely in the query context). This is a huge usability problem and I'm looking forward to writing a proposal for expressing such query hints within the SQL queries.
   
   Whether sensible defaults trump consistency is a difficult question to answer. Luckily, this tie has already been broken in this context because the set of evaluating UDFs all work both on sketches and on regular fields, so that they act as shortcuts: If all I want is to create a sketch and then to finalize it, then I can directly use the above set of evaluating sketch functions.
   
   Putting all of this together, it seems that one just needs to follow the same concept for the given HLL/THETA sketches, which means that their generative UDFs aught to always return the sketches and the evaluating UDFs should always returned finalized results and can act as a shortcut to short-circuit the generative UDFS.
   
   The DS_HLL and DS_THETA functions just deviate from the common approach taken for the other sketches which also is reflected in their documentation which in my opinion correctly formulates the intention that these generative functions should yield a sketch and not a finalized result.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jon-wei commented on issue #9419: SQL: top-level sketch functions do not return sketches

Posted by GitBox <gi...@apache.org>.
jon-wei commented on issue #9419: SQL: top-level sketch functions do not return sketches
URL: https://github.com/apache/druid/issues/9419#issuecomment-594162644
 
 
   @sascha-coenen 
   
   Re: consistency, SQL currently always "finalizes" the aggregators, the specific finalization behavior comes from the underlying implementation of `AggregatorFactory.finalize()`. The HLL and Theta sketches `finalize()` implementations follow the core aggregators `HyperUniquesAggregatorFactory` and `ApproximateHistogramAggregatorFactory` in not returning the serialized sketch (the latter only received an option to finalize as base64 later on, false by default).
   
   Re: control/expressibility, I agree with you, and the issues with using the query context are noted, but I would need to think more about how that functionality would be best expressed/implemented. Maybe having certain "complex" aggs not be finalized by default in SQL, or an approach that you mentioned here:
   
   > This is a huge usability problem and I'm looking forward to writing a proposal for expressing such query hints within the SQL queries.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] sascha-coenen edited a comment on issue #9419: SQL: top-level sketch functions do not return sketches

Posted by GitBox <gi...@apache.org>.
sascha-coenen edited a comment on issue #9419: SQL: top-level sketch functions do not return sketches
URL: https://github.com/apache/druid/issues/9419#issuecomment-592413853
 
 
   > @vikramsinghchandel You can set the same context options in the JDBC connection properties: https://druid.apache.org/docs/latest/querying/sql.html#connection-context
   
   hmm, Vikram already pointed out that he is aware of this. His need is to be able to express such things per query: 
   > Now both of these systems **don't have per query level context settings** **they are set at JDBC driver level.**
   
   --
   
   > the complex aggs like sketches are intended to be "finalized" (give the estimate) unless the user explicitly sets finalize=false in the context, so the bug would be that it's returning the base64 encodings with the additional projections
   
   I respectfully disagree:
   
   There are three dimensions to what makes a good user experience
   a) sensible defaults
   b) consistency
   b) control
   
   wrt **sensible defaults**, I agree with you that the common case is to want to materialize a sketch rather than to look at its raw form and therefore finalizing a sketch would be the most sensible default. 
   
   wrt **consistency** however, for all other sketches the current set of Druid SQL UDFs are designed around separating the responsibilities of creating a sketch and then evaluating it. It is also the common case to have to do this for the simple reason that - by default - there is no 1:1 relationship between aggregations and post aggregations. (i.e. should a doubleSum() be post-aggregated with a LOG10 or a SQRT ?  should a quantile sketch be post-aggregated with a union/intersect/not ?)
   So all sketches have UDFs that 
   a) generate a sketch 
   e.g. BLOOM_FILTER, TDIGEST_GENERATE_SKETCH, DS_QUANTILES_SKETCH, DS_HLL, DS_THETA
   b) evalutate a sketch
   e.g. BLOOM_FILTER_TEST, TDIGEST_QUANTILE  
   
   wrt **control**, having a sensible default is okay as long as this doesn't take away control from the author in terms of being able to express his intention and having it carried out. If it is my intention to pull unfinalized sketches in order to forklift them from Druid and feed them into another system, then I need to be able to formulate this. Setting a query context is not letting me do this because this is not usually possible on a per-query case and it is also not explicit within the query that other users need to be able to read and understand. I cannot send a query to another user and be sure that it does what I want because some parts of what the query result would look like are formulated outside of the query (namely in the query context). This is a huge usability problem and I'm looking forward to writing a proposal for expressing such query hints within the SQL queries.
   
   Whether sensible defaults trump consistency is a difficult question to answer. Luckily, this tie has already been broken in this context because the set of evaluating UDFs all work both on sketches and on regular fields, so that they act as shortcuts: If all I want is to create a sketch and then to finalize it, then I can directly use the above set of evaluating sketch functions.
   
   Putting all of this together, it seems that one just needs to follow the same concept for the given HLL/THETA sketches, which means that their generative UDFs aught to always return the sketches and the evaluating UDFs should always returned finalized results and can act as a shortcut to short-circuit the generative UDFS.
   
   The DS_HLL and DS_THETA functions just deviate from the common approach taken for the other sketches which also is reflected in their documentation which in my opinion correctly formulates the intention that these generative functions should yield a sketch and not a finalized result.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] vikramsinghchandel commented on issue #9419: SQL: top-level sketch functions do not return sketches

Posted by GitBox <gi...@apache.org>.
vikramsinghchandel commented on issue #9419: SQL: top-level sketch functions do not return sketches
URL: https://github.com/apache/druid/issues/9419#issuecomment-591902787
 
 
   Hi @jon-wei 
   Thanks a lot for the solution. But I think the issue here has a much wider scope, As there are systems(viz. Superset/DBeaver) which don't rely on JSON based querying and only have SQL mode.
   
   Now both of these systems (there may be more) don't have per query level context settings they are set at JDBC driver level.
   Or is there a way to set these query context in SQL itself?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jon-wei commented on issue #9419: SQL: top-level sketch functions do not return sketches

Posted by GitBox <gi...@apache.org>.
jon-wei commented on issue #9419: SQL: top-level sketch functions do not return sketches
URL: https://github.com/apache/druid/issues/9419#issuecomment-592180842
 
 
   @vikramsinghchandel You can set the same context options in the JDBC connection properties: https://druid.apache.org/docs/latest/querying/sql.html#connection-context
   
   @sascha-coenen Thanks for the report, the complex aggs like sketches are intended to be "finalized" (give the estimate) unless the user explictly sets finalize=false in the context, so the bug would be that it's returning the base64 encodings with the additional projections

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org