You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by "abhishekrb19 (via GitHub)" <gi...@apache.org> on 2023/05/29 04:22:00 UTC

[GitHub] [druid] abhishekrb19 opened a new issue, #14346: Expose Druid functions via `INFORMATION_SCHEMA.ROUTINES` table.

abhishekrb19 opened a new issue, #14346:
URL: https://github.com/apache/druid/issues/14346

   ### Description
   
   Expose Druid functions and operators programmatically via the Druid SQL interface.
   
   ### Motivation:
   - As a Druid user, I want to be able to programmatically retrieve metadata about registered functions so that I can inspect the system and write automation/tooling that deals with various functions.
   - As a Druid client, I want to be able to retrieve the full list of available functions from a Druid cluster (even from extensions) so I don't have to write code to deal with new functions, types, etc. For example, the Druid web console relies on the [documented functions](https://github.com/apache/druid/blob/f6a0888bc07396a4afab91f5cb576793e26275d2/web-console/script/create-sql-docs.js#L66) to parse them at build time to provide suggestions at runtime. However, this information can be incomplete as new functions can be loaded through extensions at runtime.
   
   
   
   ### Design:
   Add a new table `INFORMATION_SCHEMA.ROUTINES` that exposes SQL functions and operators.  The `INFORMATION_SCHEMA.ROUTINES` table will include the following columns:
   1. ROUTINE_CATALOG: Name of the database. This will always be set to `druid`.
   2. ROUTINE_SCHEMA: Name of the schema. This will always be set to `INFORMATION_SCHEMA`.
   3. ROUTINE_NAME: Name of the function or operator. E.g., `APPROX_COUNT_DISTINCT_DS_THETA`
   4. ROUTINE_TYPE:  Type of routine - `FUNCTION`, `FUNCTION_STAR`, `POSTFIX`, etc.
   5. IS_DETERMINISTIC:  Is the routine deterministic. Returns `YES` for deterministic; `NO` if the routine is nondeterministic.
   6. IS_AGGREGATOR: Is the routine an aggregator or not. Returns `YES` for aggregator functions; `NO` for scalar functions.
   7. SIGNATURES: Possible signatures for the routine as a string.
   
   Note that Druid-specific columns such as `IS_AGGREGATOR`, `SIGNATURES` are also included besides the standard set.
   
   **Example usage:**
   To see information about all the aggregator functions, including ones loaded from extensions, run the following SQL query:
   
   ```sql
   SELECT "ROUTINE_NAME", "IS_DETERMINISTIC", "SIGNATURES"
   FROM "INFORMATION_SCHEMA"."ROUTINES"
   WHERE "ROUTINE_TYPE" = 'FUNCTION' AND 'IS_AGGREGATOR' = 'YES'
   ```
   
   **Other alternatives:** Custom `sys` tables were considered. However, `INFORMATION_SCHEMA.ROUTINES` is the SQL standard to expose stored procedures, routines, and built-in types.
   
   
   ### Implementation sketch:
   Calcite knows about the registered operators, including the ones registered in extensions. So extracting this information from Calcite would be the best way.  Therefore, we can use the `DruidOperatorTable` that implements Calcite's operator table interface. The `DruidOperatorTable` is aware of all the registered operators at runtime, so we can wire this up into the new `INFORMATION_SCHEMA` table.
   
   ### Future work:
   In addition to the above columns, I think we could also add more columns over time, including:
   1. Extension/module name
   2. Return type
   3. Description of the routine, if available.
    
   Getting this information may involve adding more functionality to `DruidOperatorTable`.
   
   
   
   Note that this proposal is only for the Druid functions and operators. Exposing data types would be a separate proposal on its own.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on issue #14346: Expose Druid functions via `INFORMATION_SCHEMA.ROUTINES` table.

Posted by "abhishekagarwal87 (via GitHub)" <gi...@apache.org>.
abhishekagarwal87 commented on issue #14346:
URL: https://github.com/apache/druid/issues/14346#issuecomment-1567722514

   Thank you for the proposal, Abhishek. what does a deterministic routine mean? How is the IS_DETERMINISTIC field supposed to be used?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekrb19 commented on issue #14346: Expose Druid functions via `INFORMATION_SCHEMA.ROUTINES` table.

Posted by "abhishekrb19 (via GitHub)" <gi...@apache.org>.
abhishekrb19 commented on issue #14346:
URL: https://github.com/apache/druid/issues/14346#issuecomment-1581876227

   >I'm wondering for ROUTINE_TYPE, what does the FUNCTION_STAR or POSTFIX mean?
   
   @FrankChen021, please see the updated description -- the `ROUTINE_TYPE` column will always be `FUNCTION`. Earlier, I thought we could return `FUNCTION_STAR`, `FUNCTION_ID`, `SPECIAL`, etc. -- information extracted from Calcite's [SqlSyntax](https://calcite.apache.org/javadocAggregate/org/apache/calcite/sql/SqlSyntax.html) for an operator. However, those Calcite enums are not very useful, IMO. So we filter only for these Calcite function syntaxes and always return `FUNCTION` for the `ROUTINE_TYPE` column. Let me know if that makes sense.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekrb19 commented on issue #14346: Expose Druid functions via `INFORMATION_SCHEMA.ROUTINES` table.

Posted by "abhishekrb19 (via GitHub)" <gi...@apache.org>.
abhishekrb19 commented on issue #14346:
URL: https://github.com/apache/druid/issues/14346#issuecomment-1581877825

   > But there's one thing, what's the relationship between the description in the code and the function description in current markdown file? I don't think developers want to write these description in two different places twice.
   So, maybe one way is that, during mvn source generation phase, we can extract the description from the markdown files and generate some string constants that can be referenced from the default implementation of functionDescription().
   
   Good suggestion! I'll think about it more after the main implementation is in place. Also, @vtlim had some thoughts on this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] zachjsh closed issue #14346: Expose Druid functions via `INFORMATION_SCHEMA.ROUTINES` table.

Posted by "zachjsh (via GitHub)" <gi...@apache.org>.
zachjsh closed issue #14346: Expose Druid functions via `INFORMATION_SCHEMA.ROUTINES` table.
URL: https://github.com/apache/druid/issues/14346


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] vogievetsky commented on issue #14346: Expose Druid functions via `INFORMATION_SCHEMA.ROUTINES` table.

Posted by "vogievetsky (via GitHub)" <gi...@apache.org>.
vogievetsky commented on issue #14346:
URL: https://github.com/apache/druid/issues/14346#issuecomment-1568731403

   Could you link the the documentation that you are basing this on? I tried googling for "INFORMATION_SCHEMA.ROUTINES" and I keep finding docs from different databases that have more or less columns in the report. What is the docs you are following?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekrb19 commented on issue #14346: Expose Druid functions via `INFORMATION_SCHEMA.ROUTINES` table.

Posted by "abhishekrb19 (via GitHub)" <gi...@apache.org>.
abhishekrb19 commented on issue #14346:
URL: https://github.com/apache/druid/issues/14346#issuecomment-1572718755

   As a follow-up to the proposed design implementation, we can add a new column, `ROUTINE_COMMENT`, to `INFORMATION_SCHEMA.ROUTINES` table. The new column allows developers to provide additional information or context about a routine/function. A few vendor implementations that do something similar:
   - MySQL's [info schema routine table](https://dev.mysql.com/doc/refman/5.7/en/information-schema-routines-table.html)
   - MariaDB's [info schema routine table](https://mariadb.com/kb/en/information-schema-routines-table/#:~:text=The%20Information%20Schema%20ROUTINES%20table,stored%20procedures%20and%20stored%20functions.&text=Always%20def%20.&text=Database%20name%20associated%20with%20the%20routine)
   
   `ROUTINE_COMMENT` is optional; some functions may not have a description if it's not defined by a developer. This will be useful for generating documentation so we have a single source of truth and this feature can be used in the Druid web console too.
   
   So with those goals in mind, I think an implementation that will work:
   - Add a default implementation to [SqlAggregator interface](https://github.com/apache/druid/blob/master/sql/src/main/java/org/apache/druid/sql/calcite/aggregation/SqlAggregator.java) that returns null:
   ```java
     /**
      * @return an optional description about the Sql aggregator.
      */
     @Nullable
     default String functionDescription() {
       return null;
     }
   ```
   - We can override the default implementation for all aggregator functions already in Druid (including the ones defined in open-source extensions). We can reuse the descriptions from the [Druid docs](https://druid.apache.org/docs/latest/querying/sql-functions.html).
    - The default implementation will return null for functions in the wild (from custom extensions or so) that are unaware of `functionDescription()`. It's entirely up to the developer to override the function.
   
   Comments or feedback is appreciated!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekagarwal87 commented on issue #14346: Expose Druid functions via `INFORMATION_SCHEMA.ROUTINES` table.

Posted by "abhishekagarwal87 (via GitHub)" <gi...@apache.org>.
abhishekagarwal87 commented on issue #14346:
URL: https://github.com/apache/druid/issues/14346#issuecomment-1575559177

   @abhishekrb19  - the use case behind the column `IS_DETERMINISTIC` is not very clear to me. We can leave out this column and add it when it's actually necessary. The decision to cache the results is a bit more involved since the underlying data is constantly changing (there is real-time data). 
   
   The `functionDescription` idea is interesting. But then we either commit to it fully and fill in all the implementations of existing functions or we don't do it at all. Because, if we don't fully commit to it, a tool cannot rely on this API to get the description of the function. Are you going to use this function in your client? If you are going to do the work of adding documentation to all the scalar and aggregator functions, then I am +1 for the idea. Otherwise, this enhancement can be done later. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] FrankChen021 commented on issue #14346: Expose Druid functions via `INFORMATION_SCHEMA.ROUTINES` table.

Posted by "FrankChen021 (via GitHub)" <gi...@apache.org>.
FrankChen021 commented on issue #14346:
URL: https://github.com/apache/druid/issues/14346#issuecomment-1579939258

   > As a follow-up to the proposed design implementation, we can add a new column, ROUTINE_COMMENT, to INFORMATION_SCHEMA.ROUTINES table. 
   
   I like the this proposal. 
   
   But there's one thing, what's the relationship between the description in the code and the function description in current markdown file? I don't think developers want to write these description in two different places twice.
   
   So, maybe one way is that, during mvn source generation phase, we can extract the description from the markdown files and generate some string constants  that can be referenced from the default implementation of `functionDescription()`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] FrankChen021 commented on issue #14346: Expose Druid functions via `INFORMATION_SCHEMA.ROUTINES` table.

Posted by "FrankChen021 (via GitHub)" <gi...@apache.org>.
FrankChen021 commented on issue #14346:
URL: https://github.com/apache/druid/issues/14346#issuecomment-1579934676

   This is very useful for client side tools. And the web-console can also benefit from this change to implement a more simplier autocompletion suggestion.
   
   I'm wondering for `ROUTINE_TYPE`, what does the `FUNCTION_STAR` or `POSTFIX` mean?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] vogievetsky commented on issue #14346: Expose Druid functions via `INFORMATION_SCHEMA.ROUTINES` table.

Posted by "vogievetsky (via GitHub)" <gi...@apache.org>.
vogievetsky commented on issue #14346:
URL: https://github.com/apache/druid/issues/14346#issuecomment-1570732139

   Thank you for the link


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekrb19 commented on issue #14346: Expose Druid functions via `INFORMATION_SCHEMA.ROUTINES` table.

Posted by "abhishekrb19 (via GitHub)" <gi...@apache.org>.
abhishekrb19 commented on issue #14346:
URL: https://github.com/apache/druid/issues/14346#issuecomment-1567835147

   @abhishekagarwal87, `IS_DETERMINISTIC` in this table denotes whether the function will always produce the same output for the same set of inputs. It's a standard column from the SQL spec that a lot of vendors expose. For Druid, this information can be determined from Calcite [here](https://calcite.apache.org/javadocAggregate/org/apache/calcite/sql/SqlOperator.html#isDeterministic()).
   
   AFAIK, in Druid, all the functions [documented here](https://druid.apache.org/docs/latest/querying/sql-functions.html) are deterministic by default. Found an old PR (and linked issues) that attempted to add `RAND()`, which would have been non-deterministic, I think, but it wasn't merged. Perhaps someone has implemented a non-deterministic function in the wild in an extension?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekrb19 commented on issue #14346: Expose Druid functions via `INFORMATION_SCHEMA.ROUTINES` table.

Posted by "abhishekrb19 (via GitHub)" <gi...@apache.org>.
abhishekrb19 commented on issue #14346:
URL: https://github.com/apache/druid/issues/14346#issuecomment-1570700301

   @vogievetsky, the proposal is based on the SQL 1999 specification - http://web.cecs.pdx.edu/~len/sql1999.pdf. Specifically for `INFORMATION_SCHEMA.ROUTINES`, please see the relevant sections:
   - Section 20.45 - ROUTINES view
   - A mention in section 4.24 and section 20.69 
   
   The column definitions are sort of scattered throughout the spec. Then I was also looking at a few [vendor implementations](https://en.wikipedia.org/wiki/Information_schema). 
   
   The spec notes several columns that may not apply to Druid. For example, `CREATED` and `LAST_ALTERED` information only applies to stored procedures. So a few follow-up questions:
   1. Should we add most/all columns from the spec to the table and mark them as "unused" (similar to what's [documented](https://druid.apache.org/docs/latest/querying/sql-metadata-tables.html) for the  `INFORMATION_SCHEMA.COLUMNS` and `INFORMATION_SCHEMA.SCHEMATA` tables). Worth noting that the "unused" columns we document is an incomplete list from the spec, so I wonder if they're just there for legacy reasons.
   2. Should we include the specific columns that apply to Druid selectively?
   
   I think 2 makes sense, but let me know what your thoughts are.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] abhishekrb19 commented on issue #14346: Expose Druid functions via `INFORMATION_SCHEMA.ROUTINES` table.

Posted by "abhishekrb19 (via GitHub)" <gi...@apache.org>.
abhishekrb19 commented on issue #14346:
URL: https://github.com/apache/druid/issues/14346#issuecomment-1576085769

   @abhishekagarwal87, thank you for the comments. Yes, removing `IS_DETERMINISTIC` sounds good. As a side note, we currently don't implement the [deterministic function](https://calcite.apache.org/javadocAggregate/org/apache/calcite/sql/SqlOperator.html#isDeterministic()), so it defaults to true for all functions.
   
   As far as `functionDescription` is concerned, my thought is we can fully update the description for _all_ functions in Apache Druid. The fallback null implementation proposed is only for proprietary ones in the wild. I updated the comment with a few applications. Specifically, our client won't need the function description right out of the bat; it's primarily for the Druid web-console and documentation purposes, so I will follow that up after the main work is complete.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org