You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/04/15 10:24:59 UTC

[GitHub] [druid] gianm opened a new pull request #9704: Refresh query docs.

gianm opened a new pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704
 
 
   To aid in review, I've uploaded a render. The pages with the most changes are:
   
   - https://staging-druid.imply.io/docs/latest/querying/datasource.html (new)
   - https://staging-druid.imply.io/docs/latest/querying/query-execution.html (new)
   - https://staging-druid.imply.io/docs/latest/querying/sql.html (refreshed)
   - https://staging-druid.imply.io/docs/latest/querying/joins.html (refreshed)
   
   Larger changes:
   
   - New doc: querying/datasource.md describes the various kinds of
   datasources you can use, and has examples for both SQL and native.
   - New doc: querying/query-execution.md describes how native queries
   are executed at a high level. It doesn't go into the details of specific
   query engines or how queries run at a per-segment level. But I think it
   would be good to add or link that content here in the future.
   - Refreshed doc: querying/sql.md updated to refer to joins, reformatted
   a bit, added a new "Query translation" section that explains how
   queries are translated from SQL to native, and removed configuration
   details (moved to configuration/index.md).
   - Refreshed doc: querying/joins.md updated to refer to join datasources.
   
   Smaller changes:
   
   - Add helpful banners to the top of query documentation pages telling
   people whether a given page describes SQL, native, or both.
   - Add SQL metrics to operations/metrics.md.
   - Add some color and cross-links in various places.
   - Add native query component docs to the sidebar, and renamed them so
   they look nicer.
   - Remove Select query from the sidebar.
   - Fix Broker SQL configs in configuration/index.md. Remove them from
   querying/sql.md.
   - Combined querying/searchquery.md and querying/searchqueryspec.md.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409119791
 
 

 ##########
 File path: docs/querying/joins.md
 ##########
 @@ -22,33 +22,13 @@ title: "Joins"
   ~ under the License.
   -->
 
+Druid has two features related to joining of data:
 
-Apache Druid has limited support for joins through [query-time lookups](../querying/lookups.md). The common use case of
-query-time lookups is to replace one dimension value (e.g. a String ID) with another value (e.g. a human-readable String value). This is similar to a star-schema join.
+1. [Join](datasource.md#join) operators. These are available using a [join datasource](datasource.md#join) in native
+queries, or using the [JOIN operator](sql.md#query-syntax) in Druid SQL. Refer to the
+[join datasource](datasource.md#join) documentation for information about how joins work in Druid.
+2. [Query-time lookups](lookups.md), simple key-to-value mappings. These are preloaded on all servers and can be
+accessed with or without an explicit join operator. Refer to the [lookups](lookups.md) documentation for more details.
 
-Druid does not yet have full support for joins. Although Druid’s storage format would allow for the implementation
-of joins (there is no loss of fidelity for columns included as dimensions), full support for joins have not yet been implemented yet
-for the following reasons:
-
-1. Scaling join queries has been, in our professional experience,
-a constant bottleneck of working with distributed databases.
-2. The incremental gains in functionality are perceived to be
-of less value than the anticipated problems with managing
-highly concurrent, join-heavy workloads.
-
-A join query is essentially the merging of two or more streams of data based on a shared set of keys. The primary
-high-level strategies for join queries we are aware of are a hash-based strategy or a
-sorted-merge strategy. The hash-based strategy requires that all but
-one data set be available as something that looks like a hash table,
-a lookup operation is then performed on this hash table for every
-row in the “primary” stream. The sorted-merge strategy assumes
-that each stream is sorted by the join key and thus allows for the incremental
-joining of the streams. Each of these strategies, however,
-requires the materialization of some number of the streams either in
-sorted order or in a hash table form.
-
-When all sides of the join are significantly large tables (> 1 billion
-records), materializing the pre-join streams requires complex
-distributed memory management. The complexity of the memory
-management is only amplified by the fact that we are targeting highly
-concurrent, multi-tenant workloads.
+Whenever possible, for best performance it is good to avoid joins at query time. Often this can be accomplished by
+joining data before it is loaded into Druid.
 
 Review comment:
   I think those examples are really the same thing. I combined them added another. Here is the result. How does it look?
   
   > Whenever possible, for best performance it is good to avoid joins at query time. Often this can be accomplished by joining data before it is loaded into Druid. However, there are situations where joins or lookups are the best solution available despite the performance overhead, including:
   >
   > - The fact-to-dimension (star and snowflake schema) case: you need to change dimension values after initial ingestion, and aren't able to reingest to do this. In this case, you can use lookups for your dimension tables.
   > - Your workload requires joins or filters on subqueries.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409094878
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always creating or writing into a table
+datasource. But at query time, there are many other types of datasources available.
 
-A data source is the Apache Druid equivalent of a database table. However, a query can also masquerade as a data source, providing subquery-like functionality. Query data sources are currently supported only by [GroupBy](../querying/groupbyquery.md) queries.
+In the [Druid SQL](sql.html) language, datasources are provided in the [`FROM` clause](sql.html#from).
 
-### Table datasource
-The table data source is the most common type. It's represented by a string, or by the full structure:
+The word "datasource" is generally spelled `dataSource` (with a capital S) when it appears in API requests and
+responses.
 
+## Datasource type
+
+### `table`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT column1, column2 FROM "druid"."dataSourceName"
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": "dataSourceName",
+  "columns": ["column1", "column2"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+The table datasource is the most common type. This is the kind of datasource you get when you perform
+[data ingestion](../ingestion/index.html). They are split up into segments, distributed around the cluster,
+and queried in parallel.
+
+In [Druid SQL](sql.html#from), table datasources reside in the the `druid` schema. This is the default schema, so table
+datasources can be referenced as either `druid.dataSourceName` or simply `dataSourceName`.
+
+In native queries, table datasources can be referenced using their names as strings (as in the example above), or by
+using JSON objects of the form:
+
+```json
+"dataSource": {
+  "type": "table",
+  "name": "dataSourceName"
+}
+```
+
+To see a list of all table datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'druid'`.
+
+### `lookup`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT k, v FROM lookup.countries
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": {
+    "type": "lookup",
+    "lookup": "countries"
+  },
+  "columns": ["k", "v"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Lookup datasources correspond to Druid's key-value [lookup](lookups.html) objects. In [Druid SQL](sql.html#from),
+they reside in the the `lookup` schema. They are preloaded in memory on all servers, so they can be accessed rapidly.
 
 Review comment:
   I replaced it with "all servers that are involved in queries".

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409097989
 
 

 ##########
 File path: docs/querying/post-aggregations.md
 ##########
 @@ -1,6 +1,6 @@
 ---
 id: post-aggregations
-title: "Post-Aggregations"
+title: "Postaggregations"
 
 Review comment:
   How about "Post-aggregations".

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409099699
 
 

 ##########
 File path: docs/querying/query-context.md
 ##########
 @@ -46,26 +55,28 @@ The query context is used for various query configuration parameters. The follow
 |parallelMergeInitialYieldRows|`druid.processing.merge.task.initialYieldNumRows`|Number of rows to yield per ForkJoinPool merge task for parallel result merging on the Broker, before forking off a new task to continue merging sequences. See [Broker configuration](../configuration/index.html#broker) for more details.|
 |parallelMergeSmallBatchRows|`druid.processing.merge.task.smallBatchNumRows`|Size of result batches to operate on in ForkJoinPool merge tasks for parallel result merging on the Broker. See [Broker configuration](../configuration/index.html#broker) for more details.|
 
+## Query-type-specific parameters
 
 Review comment:
    /cc @jon-wei 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] sthetland commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
sthetland commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r408960612
 
 

 ##########
 File path: docs/querying/joins.md
 ##########
 @@ -22,33 +22,13 @@ title: "Joins"
   ~ under the License.
   -->
 
+Druid has two features related to joining of data:
 
-Apache Druid has limited support for joins through [query-time lookups](../querying/lookups.md). The common use case of
-query-time lookups is to replace one dimension value (e.g. a String ID) with another value (e.g. a human-readable String value). This is similar to a star-schema join.
+1. [Join](datasource.md#join) operators. These are available using a [join datasource](datasource.md#join) in native
+queries, or using the [JOIN operator](sql.md#query-syntax) in Druid SQL. Refer to the
+[join datasource](datasource.md#join) documentation for information about how joins work in Druid.
+2. [Query-time lookups](lookups.md), simple key-to-value mappings. These are preloaded on all servers and can be
+accessed with or without an explicit join operator. Refer to the [lookups](lookups.md) documentation for more details.
 
-Druid does not yet have full support for joins. Although Druid’s storage format would allow for the implementation
-of joins (there is no loss of fidelity for columns included as dimensions), full support for joins have not yet been implemented yet
-for the following reasons:
-
-1. Scaling join queries has been, in our professional experience,
-a constant bottleneck of working with distributed databases.
-2. The incremental gains in functionality are perceived to be
-of less value than the anticipated problems with managing
-highly concurrent, join-heavy workloads.
-
-A join query is essentially the merging of two or more streams of data based on a shared set of keys. The primary
-high-level strategies for join queries we are aware of are a hash-based strategy or a
-sorted-merge strategy. The hash-based strategy requires that all but
-one data set be available as something that looks like a hash table,
-a lookup operation is then performed on this hash table for every
-row in the “primary” stream. The sorted-merge strategy assumes
-that each stream is sorted by the join key and thus allows for the incremental
-joining of the streams. Each of these strategies, however,
-requires the materialization of some number of the streams either in
-sorted order or in a hash table form.
-
-When all sides of the join are significantly large tables (> 1 billion
-records), materializing the pre-join streams requires complex
-distributed memory management. The complexity of the memory
-management is only amplified by the fact that we are targeting highly
-concurrent, multi-tenant workloads.
+Whenever possible, for best performance it is good to avoid joins at query time. Often this can be accomplished by
 
 Review comment:
   Slightly reworded but should it say "avoid" or just "use wisely"?
   
   ```suggestion
   For best performance, we recommend avoiding joins at query time whenever possible. Often this can be accomplished by
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
suneet-s commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r408994814
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always creating or writing into a table
+datasource. But at query time, there are many other types of datasources available.
 
-A data source is the Apache Druid equivalent of a database table. However, a query can also masquerade as a data source, providing subquery-like functionality. Query data sources are currently supported only by [GroupBy](../querying/groupbyquery.md) queries.
+In the [Druid SQL](sql.html) language, datasources are provided in the [`FROM` clause](sql.html#from).
 
-### Table datasource
-The table data source is the most common type. It's represented by a string, or by the full structure:
+The word "datasource" is generally spelled `dataSource` (with a capital S) when it appears in API requests and
+responses.
 
+## Datasource type
+
+### `table`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT column1, column2 FROM "druid"."dataSourceName"
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": "dataSourceName",
+  "columns": ["column1", "column2"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+The table datasource is the most common type. This is the kind of datasource you get when you perform
+[data ingestion](../ingestion/index.html). They are split up into segments, distributed around the cluster,
+and queried in parallel.
+
+In [Druid SQL](sql.html#from), table datasources reside in the the `druid` schema. This is the default schema, so table
+datasources can be referenced as either `druid.dataSourceName` or simply `dataSourceName`.
+
+In native queries, table datasources can be referenced using their names as strings (as in the example above), or by
+using JSON objects of the form:
+
+```json
+"dataSource": {
+  "type": "table",
+  "name": "dataSourceName"
+}
+```
+
+To see a list of all table datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'druid'`.
+
+### `lookup`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT k, v FROM lookup.countries
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": {
+    "type": "lookup",
+    "lookup": "countries"
+  },
+  "columns": ["k", "v"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Lookup datasources correspond to Druid's key-value [lookup](lookups.html) objects. In [Druid SQL](sql.html#from),
+they reside in the the `lookup` schema. They are preloaded in memory on all servers, so they can be accessed rapidly.
 
 Review comment:
   nit: all servers is vague. Are they pre-loaded on the master nodes too? I think it can be configured so that only the nodes that need lookups have them pre-loaded. My understanding is - brokers (always), historicals (always?), overlord, MM/ Indexer (only if it's needed by ingestion).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
suneet-s commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r408974110
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always creating or writing into a table
+datasource. But at query time, there are many other types of datasources available.
 
 Review comment:
   ```suggestion
   datasource. But at query time, there are many other types of datasources available that are described [below](#datasource-type).
   ```
   
   There's only 1 sub section in this doc right now but that might change in the future. Do you think it's worth a deep link?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
suneet-s commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409123895
 
 

 ##########
 File path: docs/misc/math-expr.md
 ##########
 @@ -22,9 +22,14 @@ title: "Expressions"
   ~ under the License.
   -->
 
-
-> This feature is still experimental. It has not been optimized for performance yet, and its implementation is known to
->  have significant inefficiencies.
 
 Review comment:
   Should this be noted in the 0.18 Release notes? It looks like these changes were first included in 0.17. Not sure if this is just stale docs, or something we want to call out explicitly. As a Druid user, I just assumed this was not experimental.
   
   cc @jihoonson 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r408991406
 
 

 ##########
 File path: website/i18n/en.json
 ##########
 @@ -408,17 +417,21 @@
         "title": "Multi-value dimensions"
       },
       "querying/multitenancy": {
-        "title": "Multitenancy considerations"
+        "title": "Multitenancy considerations",
+        "sidebar_label": "Multitenancy"
       },
       "querying/post-aggregations": {
-        "title": "Post-Aggregations"
+        "title": "Postaggregations"
 
 Review comment:
   What do you think about "Post-aggregations"?
   
   This text will appear in the sidebar, where starting with a lowercase would look weird.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
suneet-s commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r408974884
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always creating or writing into a table
+datasource. But at query time, there are many other types of datasources available.
 
-A data source is the Apache Druid equivalent of a database table. However, a query can also masquerade as a data source, providing subquery-like functionality. Query data sources are currently supported only by [GroupBy](../querying/groupbyquery.md) queries.
+In the [Druid SQL](sql.html) language, datasources are provided in the [`FROM` clause](sql.html#from).
 
 Review comment:
   this isn't necessarily true? - eg. Join data sources

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
suneet-s commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r408989981
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always creating or writing into a table
+datasource. But at query time, there are many other types of datasources available.
 
-A data source is the Apache Druid equivalent of a database table. However, a query can also masquerade as a data source, providing subquery-like functionality. Query data sources are currently supported only by [GroupBy](../querying/groupbyquery.md) queries.
+In the [Druid SQL](sql.html) language, datasources are provided in the [`FROM` clause](sql.html#from).
 
-### Table datasource
-The table data source is the most common type. It's represented by a string, or by the full structure:
+The word "datasource" is generally spelled `dataSource` (with a capital S) when it appears in API requests and
+responses.
 
+## Datasource type
+
+### `table`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT column1, column2 FROM "druid"."dataSourceName"
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": "dataSourceName",
+  "columns": ["column1", "column2"],
+  "intervals": ["0000/3000"]
 
 Review comment:
   I don't think these intervals match the intervals used by the sql query in the tab above

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
suneet-s commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409097131
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always creating or writing into a table
+datasource. But at query time, there are many other types of datasources available.
 
-A data source is the Apache Druid equivalent of a database table. However, a query can also masquerade as a data source, providing subquery-like functionality. Query data sources are currently supported only by [GroupBy](../querying/groupbyquery.md) queries.
+In the [Druid SQL](sql.html) language, datasources are provided in the [`FROM` clause](sql.html#from).
 
 Review comment:
   oops - I thought the "JOIN" statement was on the same level as FROM, WHERE, etc. 🤦 
   
   Maybe a better example is subqueries that are part of a filter
   
   ```
   SELECT table.key, table.metric from table where table.key = (select ANY_VALUE(lookyloo.k) from lookup.lookyloo where lookyloo.v='FILTER')
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
suneet-s commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409126043
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always creating or writing into a table
+datasource. But at query time, there are many other types of datasources available.
 
-A data source is the Apache Druid equivalent of a database table. However, a query can also masquerade as a data source, providing subquery-like functionality. Query data sources are currently supported only by [GroupBy](../querying/groupbyquery.md) queries.
+In the [Druid SQL](sql.html) language, datasources are provided in the [`FROM` clause](sql.html#from).
 
-### Table datasource
-The table data source is the most common type. It's represented by a string, or by the full structure:
+The word "datasource" is generally spelled `dataSource` (with a capital S) when it appears in API requests and
+responses.
 
+## Datasource type
+
+### `table`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT column1, column2 FROM "druid"."dataSourceName"
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": "dataSourceName",
+  "columns": ["column1", "column2"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+The table datasource is the most common type. This is the kind of datasource you get when you perform
+[data ingestion](../ingestion/index.html). They are split up into segments, distributed around the cluster,
+and queried in parallel.
+
+In [Druid SQL](sql.html#from), table datasources reside in the the `druid` schema. This is the default schema, so table
+datasources can be referenced as either `druid.dataSourceName` or simply `dataSourceName`.
+
+In native queries, table datasources can be referenced using their names as strings (as in the example above), or by
+using JSON objects of the form:
+
+```json
+"dataSource": {
+  "type": "table",
+  "name": "dataSourceName"
+}
+```
+
+To see a list of all table datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'druid'`.
+
+### `lookup`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT k, v FROM lookup.countries
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": {
+    "type": "lookup",
+    "lookup": "countries"
+  },
+  "columns": ["k", "v"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Lookup datasources correspond to Druid's key-value [lookup](lookups.html) objects. In [Druid SQL](sql.html#from),
+they reside in the the `lookup` schema. They are preloaded in memory on all servers, so they can be accessed rapidly.
+They can be joined onto regular tables using the [join operator](#join).
+
+Lookup datasources are key-value oriented and always have exactly two columns: `k` (the key) and `v` (the value), and
+both are always strings.
+
+To see a list of all lookup datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'lookup'`.
+
+> Performance tip: Lookups can be joined with a base table either using an explicit [join](#join), or by using the
+> SQL [`LOOKUP` function](sql.html#string-functions).
+> However, the join operator must evaluate the condition on each row, whereas the
+> `LOOKUP` function can defer evaluation until after an aggregation phase. This means that the `LOOKUP` function is
+> usually faster than joining to a lookup datasource.
+
+### `query`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Uses a subquery to count hits per page, then takes the average.
+SELECT
+  AVG(cnt) AS average_hits_per_page
+FROM
+  (SELECT page, COUNT(*) AS hits FROM site_traffic GROUP BY page)
+```
+<!--Native-->
+```json
+{
+  "queryType": "timeseries",
+  "dataSource": {
+    "type": "query",
+    "query": {
+      "queryType": "groupBy",
+      "dataSource": "site_traffic",
+      "intervals": ["0000/3000"],
+      "granularity": "all",
+      "dimensions": ["page"],
+      "aggregations": [
+        { "type": "count", "name": "hits" }
+      ]
+    }
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "aggregations": [
+    { "type": "longSum", "name": "hits", "fieldName": "hits" },
+    { "type": "count", "name": "pages" }
+  ],
+  "postAggregations": [
+    { "type": "expression", "name": "average_hits_per_page", "expression": "hits / pages" }
+  ]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Query datasources allow you to issue subqueries. In native queries, they can appear anywhere that accepts a
+`dataSource`. In SQL, they can appear in the following places, always surrounded by parentheses:
+
+- The FROM clause: `FROM (<subquery>)`.
+- As inputs to a JOIN: `<table-or-subquery-1> t1 INNER JOIN <table-or-subquery-2> t2 ON t1.<col1> = t2.<col2>`.
+- In the WHERE clause: `WHERE <column> { IN | NOT IN } (<subquery>)`. These are translated to joins by the SQL planner.
+
+> Performance tip: Subquery results need to be fully transferred to the Broker as part of query execution.
+> This means that subqueries with large result sets can cause performance bottlenecks or run into memory usage limits.
+> See the [Query execution](query-execution.md) documentation for more details on how subqueries are executed and what
+> limits will apply.
+
+### `join`
 
 Review comment:
   As someone brand new to Druid, I personally prefer an approach that builds on related concepts because a lot of these datasource types are foreign to me. However I also see merits in putting all the SQL usable datasources together.
   
   I think this is a bigger docs architecture discussion, so I'm happy to leave this as is.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409084942
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always creating or writing into a table
+datasource. But at query time, there are many other types of datasources available.
 
-A data source is the Apache Druid equivalent of a database table. However, a query can also masquerade as a data source, providing subquery-like functionality. Query data sources are currently supported only by [GroupBy](../querying/groupbyquery.md) queries.
+In the [Druid SQL](sql.html) language, datasources are provided in the [`FROM` clause](sql.html#from).
 
-### Table datasource
-The table data source is the most common type. It's represented by a string, or by the full structure:
+The word "datasource" is generally spelled `dataSource` (with a capital S) when it appears in API requests and
+responses.
 
+## Datasource type
+
+### `table`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT column1, column2 FROM "druid"."dataSourceName"
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": "dataSourceName",
+  "columns": ["column1", "column2"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+The table datasource is the most common type. This is the kind of datasource you get when you perform
+[data ingestion](../ingestion/index.html). They are split up into segments, distributed around the cluster,
+and queried in parallel.
+
+In [Druid SQL](sql.html#from), table datasources reside in the the `druid` schema. This is the default schema, so table
+datasources can be referenced as either `druid.dataSourceName` or simply `dataSourceName`.
+
+In native queries, table datasources can be referenced using their names as strings (as in the example above), or by
+using JSON objects of the form:
+
+```json
+"dataSource": {
+  "type": "table",
+  "name": "dataSourceName"
+}
+```
+
+To see a list of all table datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'druid'`.
+
+### `lookup`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT k, v FROM lookup.countries
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": {
+    "type": "lookup",
+    "lookup": "countries"
+  },
+  "columns": ["k", "v"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Lookup datasources correspond to Druid's key-value [lookup](lookups.html) objects. In [Druid SQL](sql.html#from),
+they reside in the the `lookup` schema. They are preloaded in memory on all servers, so they can be accessed rapidly.
+They can be joined onto regular tables using the [join operator](#join).
+
+Lookup datasources are key-value oriented and always have exactly two columns: `k` (the key) and `v` (the value), and
+both are always strings.
+
+To see a list of all lookup datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'lookup'`.
+
+> Performance tip: Lookups can be joined with a base table either using an explicit [join](#join), or by using the
+> SQL [`LOOKUP` function](sql.html#string-functions).
+> However, the join operator must evaluate the condition on each row, whereas the
+> `LOOKUP` function can defer evaluation until after an aggregation phase. This means that the `LOOKUP` function is
+> usually faster than joining to a lookup datasource.
+
+### `query`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Uses a subquery to count hits per page, then takes the average.
+SELECT
+  AVG(cnt) AS average_hits_per_page
+FROM
+  (SELECT page, COUNT(*) AS hits FROM site_traffic GROUP BY page)
+```
+<!--Native-->
+```json
+{
+  "queryType": "timeseries",
+  "dataSource": {
+    "type": "query",
+    "query": {
+      "queryType": "groupBy",
+      "dataSource": "site_traffic",
+      "intervals": ["0000/3000"],
+      "granularity": "all",
+      "dimensions": ["page"],
+      "aggregations": [
+        { "type": "count", "name": "hits" }
+      ]
+    }
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "aggregations": [
+    { "type": "longSum", "name": "hits", "fieldName": "hits" },
+    { "type": "count", "name": "pages" }
+  ],
+  "postAggregations": [
+    { "type": "expression", "name": "average_hits_per_page", "expression": "hits / pages" }
+  ]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Query datasources allow you to issue subqueries. In native queries, they can appear anywhere that accepts a
+`dataSource`. In SQL, they can appear in the following places, always surrounded by parentheses:
+
+- The FROM clause: `FROM (<subquery>)`.
+- As inputs to a JOIN: `<table-or-subquery-1> t1 INNER JOIN <table-or-subquery-2> t2 ON t1.<col1> = t2.<col2>`.
+- In the WHERE clause: `WHERE <column> { IN | NOT IN } (<subquery>)`. These are translated to joins by the SQL planner.
+
+> Performance tip: Subquery results need to be fully transferred to the Broker as part of query execution.
+> This means that subqueries with large result sets can cause performance bottlenecks or run into memory usage limits.
+> See the [Query execution](query-execution.md) documentation for more details on how subqueries are executed and what
+> limits will apply.
+
+### `join`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Joins "sales" with "countries" (using "store" as the join key) to get sales by country.
+SELECT
+  countries.v AS country,
+  SUM(sales.revenue) AS country_revenue
+FROM
+  sales
+  INNER JOIN lookup.store_to_country ON sales.store = store_to_country.k
+GROUP BY
+  countries.v
+```
+<!--Native-->
 ```json
 {
-	"type": "table",
-	"name": "<string_value>"
+  "queryType": "groupBy",
+  "dataSource": {
+    "type": "join",
+    "left": "sales",
+    "right": {
+      "type": "lookup",
+      "lookup": "store_to_country"
+    },
+    "rightPrefix": "r.",
+    "condition": "store == \"r.k\"",
+    "joinType": "INNER"
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "dimensions": [
+    { "type": "default", "outputName": "country", "dimension": "r.v" }
+  ],
+  "aggregations": [
+    { "type": "longSum", "name": "country_revenue", "fieldName": "revenue" }
+  ]
 }
 ```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Join datasources allow you to do a SQL-style join of two datasources. Stacking joins on top of each other allows
+you to join arbitrarily many datasources.
+
+In Druid {{DRUIDVERSION}}, joins are implemented with a broadcast hash-join algorithm. This means that all tables
+other than the leftmost "base" table must fit in memory. It also means that the join condition must be an equality. This
+feature is intended mainly to allow joining regular Druid tables with [lookup](#lookup), [inline](#inline), and
+[query](#query) datasources.
+
+For information about how Druid executes queries involving joins, refer to the
+[Query execution](query-execution.html#join) page.
+
+#### Joins in SQL
+
+SQL joins take the form:
+
+```
+<o1> [ INNER | LEFT [OUTER] ] JOIN <o2> ON <condition>
+```
+
+The condition must involve only equalities, but functions are okay, and there can be multiple equalities ANDed together.
+Conditions like `t1.x = t2.x`, or `LOWER(t1.x) = t2.x`, or `t1.x = t2.x AND t1.y = t2.y` can all be handled. Conditions
+like `t1.x <> t2.x` cannot currently be handled.
 
-### Union datasource
+Note that Druid SQL is less rigid than what native join datasources can handle. In cases where a SQL query does
+something that is not allowed as-is with a native join datasource, Druid SQL will generate a subquery. This can have
+a substantial effect on performance and scalability, so it is something to watch out for. Some examples of when the
+SQL layer will generate subqueries include:
 
-This data source unions two or more table data sources.
+- Joining a regular Druid table to itself, or to another regular Druid table. The native join datasource can accept
+a table on the left-hand side, but not the right, so a subquery is needed.
 
+- Join conditions where the expressions on either side are of different types.
+
+- Join conditions where the right-hand expression is not a direct column access.
+
+For more information about how Druid translates SQL to native queries, refer to the
+[Druid SQL](sql.md#query-translation) documentation.
+
+#### Joins in native queries
+
+Native join datasources have the following properties. All are required.
+
+|Field|Description|
+|-----|-----------|
+|`left`|Left-hand datasource. Must be of type `table`, `join`, `lookup`, `query`, or `inline`. Placing another join as the left datasource allows you to join arbitrarily many datasources.|
+|`right`|Right-hand datasource. Must be of type `lookup`, `query`, or `inline`. Note that this is more rigid than what Druid SQL requires.|
+|`rightPrefix`|String prefix that will be applied to all columns from the right-hand datasource, to prevent them from colliding with columns from the left-hand datasource. Can be any string, so long as it is nonempty and is not be a prefix of the string `__time`. Any columns from the left-hand side that start with your `rightPrefix` will be shadowed. It is up to you to provide a prefix that will not shadow any important columns from the left side.|
+|`condition`|[Expression](../misc/math-expr.md) that must be an equality where one side is an expression of the left-hand side, and the other side is a simple column reference to the right-hand side. Note that this is more rigid than what Druid SQL requires: here, the right-hand reference must be a simple column reference; in SQL it can be an expression.|
+|`joinType`|`INNER` or `LEFT`.|
+
+#### Join performance
+
+Joins are a feature that can significantly affect performance of your queries. Some performance tips and notes:
+
+1. Joins are especially useful with [lookup datasources](#lookup), but in most cases, the
+[`LOOKUP` function](sql.html#string-functions) performs better than a join. Consider using the `LOOKUP` function if
+it is appropriate for your use case.
+2. When using joins in Druid SQL, keep in mind that it can generate subqueries that you did not explicitly include in
+your queries. Refer to the [Druid SQL](sql.md#query-translation) documentation for more details about when this happens
+and how to detect it.
+3. One common reason for implicit subquery generation is if the types of the two halves of an equality do not match.
+For example, since lookup keys are always strings, the condition `druid.d JOIN lookup.l ON d.field = l.field` will
+perform best if `d.field` is a string.
+4. As of Druid {{DRUIDVERSION}}, the join operator must evaluate the condition for each row. In the future, we expect
+to implement both early and deferred condition evaluation, which we expect to improve performance considerably for
+common use cases.
+
+#### Future work for joins
+
+Joins are an area of active development in Druid. The following features are missing today but may appear in
+future versions:
+
+- Preloaded dimension tables that are wider than lookups (i.e. supporting more than a single key and single value).
+- RIGHT OUTER and FULL OUTER joins. Currently, they are partially implemented. Queries will run but results will not
+always be correct.
+- Performance-related optimizations as mentioned in the [previous section](#join-performance).
+- Join algorithms other than broadcast hash-joins.
+
+### `union`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--Native-->
 ```json
 {
-       "type": "union",
-       "dataSources": ["<string_value1>", "<string_value2>", "<string_value3>", ... ]
+  "queryType": "scan",
+  "dataSource": {
+    "type": "union",
+    "dataSources": ["<dataSourceName1>", "<dataSourceName2>", "<dataSourceName3>"]
+  },
+  "columns": ["column1", "column2"],
+  "intervals": ["0000/3000"]
 }
 ```
+<!--END_DOCUSAURUS_CODE_TABS-->
 
-Note that the data sources being unioned should have the same schema.
-Union Queries should be always sent to a Broker/Router process and are *NOT* supported directly by the Historical processes.
+Union datasources allow you to treat two or more table datasources as a single datasource. The datasources being unioned
+do not need to have identical schemas. If they do not fully match up, then columns that exist in one table but not
+another will be treated as if they contained all null values in the tables where they do not exist.
 
-### Query datasource
+Union datasources are not available in Druid SQL.
 
-This is used for nested groupBys and is only currently supported for groupBys.
+### `inline`
 
 Review comment:
   But they are also a standalone datasource type.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409098493
 
 

 ##########
 File path: docs/querying/query-context.md
 ##########
 @@ -46,26 +55,28 @@ The query context is used for various query configuration parameters. The follow
 |parallelMergeInitialYieldRows|`druid.processing.merge.task.initialYieldNumRows`|Number of rows to yield per ForkJoinPool merge task for parallel result merging on the Broker, before forking off a new task to continue merging sequences. See [Broker configuration](../configuration/index.html#broker) for more details.|
 |parallelMergeSmallBatchRows|`druid.processing.merge.task.smallBatchNumRows`|Size of result batches to operate on in ForkJoinPool merge tasks for parallel result merging on the Broker. See [Broker configuration](../configuration/index.html#broker) for more details.|
 
+## Query-type-specific parameters
 
 Review comment:
   Hmm, I'm not sure. I don't think those parameters are documented right now. Should they be or are they 'secret' internal parameters?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
jihoonson commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409132718
 
 

 ##########
 File path: docs/misc/math-expr.md
 ##########
 @@ -22,9 +22,14 @@ title: "Expressions"
   ~ under the License.
   -->
 
-
-> This feature is still experimental. It has not been optimized for performance yet, and its implementation is known to
->  have significant inefficiencies.
 
 Review comment:
   It looks worth mentioning in the release notes. Thanks.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409083629
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always creating or writing into a table
+datasource. But at query time, there are many other types of datasources available.
 
-A data source is the Apache Druid equivalent of a database table. However, a query can also masquerade as a data source, providing subquery-like functionality. Query data sources are currently supported only by [GroupBy](../querying/groupbyquery.md) queries.
+In the [Druid SQL](sql.html) language, datasources are provided in the [`FROM` clause](sql.html#from).
 
-### Table datasource
-The table data source is the most common type. It's represented by a string, or by the full structure:
+The word "datasource" is generally spelled `dataSource` (with a capital S) when it appears in API requests and
+responses.
 
+## Datasource type
+
+### `table`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT column1, column2 FROM "druid"."dataSourceName"
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": "dataSourceName",
+  "columns": ["column1", "column2"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+The table datasource is the most common type. This is the kind of datasource you get when you perform
+[data ingestion](../ingestion/index.html). They are split up into segments, distributed around the cluster,
+and queried in parallel.
+
+In [Druid SQL](sql.html#from), table datasources reside in the the `druid` schema. This is the default schema, so table
+datasources can be referenced as either `druid.dataSourceName` or simply `dataSourceName`.
+
+In native queries, table datasources can be referenced using their names as strings (as in the example above), or by
+using JSON objects of the form:
+
+```json
+"dataSource": {
+  "type": "table",
+  "name": "dataSourceName"
+}
+```
+
+To see a list of all table datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'druid'`.
+
+### `lookup`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT k, v FROM lookup.countries
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": {
+    "type": "lookup",
+    "lookup": "countries"
+  },
+  "columns": ["k", "v"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Lookup datasources correspond to Druid's key-value [lookup](lookups.html) objects. In [Druid SQL](sql.html#from),
+they reside in the the `lookup` schema. They are preloaded in memory on all servers, so they can be accessed rapidly.
+They can be joined onto regular tables using the [join operator](#join).
+
+Lookup datasources are key-value oriented and always have exactly two columns: `k` (the key) and `v` (the value), and
+both are always strings.
+
+To see a list of all lookup datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'lookup'`.
+
+> Performance tip: Lookups can be joined with a base table either using an explicit [join](#join), or by using the
+> SQL [`LOOKUP` function](sql.html#string-functions).
+> However, the join operator must evaluate the condition on each row, whereas the
+> `LOOKUP` function can defer evaluation until after an aggregation phase. This means that the `LOOKUP` function is
+> usually faster than joining to a lookup datasource.
+
+### `query`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Uses a subquery to count hits per page, then takes the average.
+SELECT
+  AVG(cnt) AS average_hits_per_page
+FROM
+  (SELECT page, COUNT(*) AS hits FROM site_traffic GROUP BY page)
+```
+<!--Native-->
+```json
+{
+  "queryType": "timeseries",
+  "dataSource": {
+    "type": "query",
+    "query": {
+      "queryType": "groupBy",
+      "dataSource": "site_traffic",
+      "intervals": ["0000/3000"],
+      "granularity": "all",
+      "dimensions": ["page"],
+      "aggregations": [
+        { "type": "count", "name": "hits" }
+      ]
+    }
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "aggregations": [
+    { "type": "longSum", "name": "hits", "fieldName": "hits" },
+    { "type": "count", "name": "pages" }
+  ],
+  "postAggregations": [
+    { "type": "expression", "name": "average_hits_per_page", "expression": "hits / pages" }
+  ]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Query datasources allow you to issue subqueries. In native queries, they can appear anywhere that accepts a
+`dataSource`. In SQL, they can appear in the following places, always surrounded by parentheses:
+
+- The FROM clause: `FROM (<subquery>)`.
+- As inputs to a JOIN: `<table-or-subquery-1> t1 INNER JOIN <table-or-subquery-2> t2 ON t1.<col1> = t2.<col2>`.
+- In the WHERE clause: `WHERE <column> { IN | NOT IN } (<subquery>)`. These are translated to joins by the SQL planner.
+
+> Performance tip: Subquery results need to be fully transferred to the Broker as part of query execution.
+> This means that subqueries with large result sets can cause performance bottlenecks or run into memory usage limits.
+> See the [Query execution](query-execution.md) documentation for more details on how subqueries are executed and what
+> limits will apply.
+
+### `join`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Joins "sales" with "countries" (using "store" as the join key) to get sales by country.
+SELECT
+  countries.v AS country,
+  SUM(sales.revenue) AS country_revenue
+FROM
+  sales
+  INNER JOIN lookup.store_to_country ON sales.store = store_to_country.k
+GROUP BY
+  countries.v
+```
+<!--Native-->
 ```json
 {
-	"type": "table",
-	"name": "<string_value>"
+  "queryType": "groupBy",
+  "dataSource": {
+    "type": "join",
+    "left": "sales",
+    "right": {
+      "type": "lookup",
+      "lookup": "store_to_country"
+    },
+    "rightPrefix": "r.",
+    "condition": "store == \"r.k\"",
+    "joinType": "INNER"
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "dimensions": [
+    { "type": "default", "outputName": "country", "dimension": "r.v" }
+  ],
+  "aggregations": [
+    { "type": "longSum", "name": "country_revenue", "fieldName": "revenue" }
+  ]
 }
 ```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Join datasources allow you to do a SQL-style join of two datasources. Stacking joins on top of each other allows
+you to join arbitrarily many datasources.
+
+In Druid {{DRUIDVERSION}}, joins are implemented with a broadcast hash-join algorithm. This means that all tables
+other than the leftmost "base" table must fit in memory. It also means that the join condition must be an equality. This
+feature is intended mainly to allow joining regular Druid tables with [lookup](#lookup), [inline](#inline), and
+[query](#query) datasources.
+
+For information about how Druid executes queries involving joins, refer to the
+[Query execution](query-execution.html#join) page.
+
+#### Joins in SQL
+
+SQL joins take the form:
+
+```
+<o1> [ INNER | LEFT [OUTER] ] JOIN <o2> ON <condition>
+```
+
+The condition must involve only equalities, but functions are okay, and there can be multiple equalities ANDed together.
+Conditions like `t1.x = t2.x`, or `LOWER(t1.x) = t2.x`, or `t1.x = t2.x AND t1.y = t2.y` can all be handled. Conditions
+like `t1.x <> t2.x` cannot currently be handled.
 
-### Union datasource
+Note that Druid SQL is less rigid than what native join datasources can handle. In cases where a SQL query does
 
 Review comment:
   Hmm. If you include the list below it (which this paragraph is connected to) it's almost the entirety of the section. I think it would be too big of a warning block. Any suggestions for something smaller that would make sense as a warning?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] ccaominh commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
ccaominh commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409011789
 
 

 ##########
 File path: docs/querying/lookups.md
 ##########
 @@ -55,21 +56,37 @@ Other lookup types are available as extensions, including:
 Query Syntax
 ------------
 
-In [Druid SQL](sql.html), lookups can be queried using the `LOOKUP` function, for example:
+In [Druid SQL](sql.html), lookups can be queried using the [`LOOKUP` function](sql.md#string-functions), for example:
 
+```sql
+SELECT
+  LOOKUP(store, 'store_to_country') AS country,
+  SUM(revenue)
+FROM sales
+GROUP BY 1
 ```
-SELECT LOOKUP(column_name, 'lookup-name'), COUNT(*) FROM datasource GROUP BY 1
+
+They can also be queried using the [JOIN operator](datasource.md#join):
+
+```sql
+SELECT
+  countries.v AS country,
 
 Review comment:
   Should this be `store_to_country.v` instead of `countries.v`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
suneet-s commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r408981030
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always creating or writing into a table
+datasource. But at query time, there are many other types of datasources available.
 
-A data source is the Apache Druid equivalent of a database table. However, a query can also masquerade as a data source, providing subquery-like functionality. Query data sources are currently supported only by [GroupBy](../querying/groupbyquery.md) queries.
+In the [Druid SQL](sql.html) language, datasources are provided in the [`FROM` clause](sql.html#from).
 
-### Table datasource
-The table data source is the most common type. It's represented by a string, or by the full structure:
+The word "datasource" is generally spelled `dataSource` (with a capital S) when it appears in API requests and
+responses.
 
+## Datasource type
+
+### `table`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT column1, column2 FROM "druid"."dataSourceName"
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": "dataSourceName",
+  "columns": ["column1", "column2"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+The table datasource is the most common type. This is the kind of datasource you get when you perform
+[data ingestion](../ingestion/index.html). They are split up into segments, distributed around the cluster,
+and queried in parallel.
+
+In [Druid SQL](sql.html#from), table datasources reside in the the `druid` schema. This is the default schema, so table
+datasources can be referenced as either `druid.dataSourceName` or simply `dataSourceName`.
+
+In native queries, table datasources can be referenced using their names as strings (as in the example above), or by
+using JSON objects of the form:
+
+```json
+"dataSource": {
+  "type": "table",
+  "name": "dataSourceName"
+}
+```
+
+To see a list of all table datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'druid'`.
+
+### `lookup`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT k, v FROM lookup.countries
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": {
+    "type": "lookup",
+    "lookup": "countries"
+  },
+  "columns": ["k", "v"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Lookup datasources correspond to Druid's key-value [lookup](lookups.html) objects. In [Druid SQL](sql.html#from),
+they reside in the the `lookup` schema. They are preloaded in memory on all servers, so they can be accessed rapidly.
+They can be joined onto regular tables using the [join operator](#join).
+
+Lookup datasources are key-value oriented and always have exactly two columns: `k` (the key) and `v` (the value), and
+both are always strings.
+
+To see a list of all lookup datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'lookup'`.
+
+> Performance tip: Lookups can be joined with a base table either using an explicit [join](#join), or by using the
+> SQL [`LOOKUP` function](sql.html#string-functions).
+> However, the join operator must evaluate the condition on each row, whereas the
+> `LOOKUP` function can defer evaluation until after an aggregation phase. This means that the `LOOKUP` function is
+> usually faster than joining to a lookup datasource.
+
+### `query`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Uses a subquery to count hits per page, then takes the average.
+SELECT
+  AVG(cnt) AS average_hits_per_page
+FROM
+  (SELECT page, COUNT(*) AS hits FROM site_traffic GROUP BY page)
+```
+<!--Native-->
+```json
+{
+  "queryType": "timeseries",
+  "dataSource": {
+    "type": "query",
+    "query": {
+      "queryType": "groupBy",
+      "dataSource": "site_traffic",
+      "intervals": ["0000/3000"],
+      "granularity": "all",
+      "dimensions": ["page"],
+      "aggregations": [
+        { "type": "count", "name": "hits" }
+      ]
+    }
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "aggregations": [
+    { "type": "longSum", "name": "hits", "fieldName": "hits" },
+    { "type": "count", "name": "pages" }
+  ],
+  "postAggregations": [
+    { "type": "expression", "name": "average_hits_per_page", "expression": "hits / pages" }
+  ]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Query datasources allow you to issue subqueries. In native queries, they can appear anywhere that accepts a
+`dataSource`. In SQL, they can appear in the following places, always surrounded by parentheses:
+
+- The FROM clause: `FROM (<subquery>)`.
+- As inputs to a JOIN: `<table-or-subquery-1> t1 INNER JOIN <table-or-subquery-2> t2 ON t1.<col1> = t2.<col2>`.
+- In the WHERE clause: `WHERE <column> { IN | NOT IN } (<subquery>)`. These are translated to joins by the SQL planner.
+
+> Performance tip: Subquery results need to be fully transferred to the Broker as part of query execution.
+> This means that subqueries with large result sets can cause performance bottlenecks or run into memory usage limits.
+> See the [Query execution](query-execution.md) documentation for more details on how subqueries are executed and what
+> limits will apply.
+
+### `join`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Joins "sales" with "countries" (using "store" as the join key) to get sales by country.
+SELECT
+  countries.v AS country,
+  SUM(sales.revenue) AS country_revenue
+FROM
+  sales
+  INNER JOIN lookup.store_to_country ON sales.store = store_to_country.k
+GROUP BY
+  countries.v
+```
+<!--Native-->
 ```json
 {
-	"type": "table",
-	"name": "<string_value>"
+  "queryType": "groupBy",
+  "dataSource": {
+    "type": "join",
+    "left": "sales",
+    "right": {
+      "type": "lookup",
+      "lookup": "store_to_country"
+    },
+    "rightPrefix": "r.",
+    "condition": "store == \"r.k\"",
+    "joinType": "INNER"
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "dimensions": [
+    { "type": "default", "outputName": "country", "dimension": "r.v" }
+  ],
+  "aggregations": [
+    { "type": "longSum", "name": "country_revenue", "fieldName": "revenue" }
+  ]
 }
 ```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Join datasources allow you to do a SQL-style join of two datasources. Stacking joins on top of each other allows
+you to join arbitrarily many datasources.
+
+In Druid {{DRUIDVERSION}}, joins are implemented with a broadcast hash-join algorithm. This means that all tables
+other than the leftmost "base" table must fit in memory. It also means that the join condition must be an equality. This
+feature is intended mainly to allow joining regular Druid tables with [lookup](#lookup), [inline](#inline), and
+[query](#query) datasources.
+
+For information about how Druid executes queries involving joins, refer to the
+[Query execution](query-execution.html#join) page.
+
+#### Joins in SQL
+
+SQL joins take the form:
+
+```
+<o1> [ INNER | LEFT [OUTER] ] JOIN <o2> ON <condition>
+```
+
+The condition must involve only equalities, but functions are okay, and there can be multiple equalities ANDed together.
+Conditions like `t1.x = t2.x`, or `LOWER(t1.x) = t2.x`, or `t1.x = t2.x AND t1.y = t2.y` can all be handled. Conditions
+like `t1.x <> t2.x` cannot currently be handled.
 
-### Union datasource
+Note that Druid SQL is less rigid than what native join datasources can handle. In cases where a SQL query does
+something that is not allowed as-is with a native join datasource, Druid SQL will generate a subquery. This can have
+a substantial effect on performance and scalability, so it is something to watch out for. Some examples of when the
+SQL layer will generate subqueries include:
 
-This data source unions two or more table data sources.
+- Joining a regular Druid table to itself, or to another regular Druid table. The native join datasource can accept
+a table on the left-hand side, but not the right, so a subquery is needed.
 
+- Join conditions where the expressions on either side are of different types.
+
+- Join conditions where the right-hand expression is not a direct column access.
+
+For more information about how Druid translates SQL to native queries, refer to the
+[Druid SQL](sql.md#query-translation) documentation.
+
+#### Joins in native queries
+
+Native join datasources have the following properties. All are required.
+
+|Field|Description|
+|-----|-----------|
+|`left`|Left-hand datasource. Must be of type `table`, `join`, `lookup`, `query`, or `inline`. Placing another join as the left datasource allows you to join arbitrarily many datasources.|
+|`right`|Right-hand datasource. Must be of type `lookup`, `query`, or `inline`. Note that this is more rigid than what Druid SQL requires.|
+|`rightPrefix`|String prefix that will be applied to all columns from the right-hand datasource, to prevent them from colliding with columns from the left-hand datasource. Can be any string, so long as it is nonempty and is not be a prefix of the string `__time`. Any columns from the left-hand side that start with your `rightPrefix` will be shadowed. It is up to you to provide a prefix that will not shadow any important columns from the left side.|
+|`condition`|[Expression](../misc/math-expr.md) that must be an equality where one side is an expression of the left-hand side, and the other side is a simple column reference to the right-hand side. Note that this is more rigid than what Druid SQL requires: here, the right-hand reference must be a simple column reference; in SQL it can be an expression.|
+|`joinType`|`INNER` or `LEFT`.|
+
+#### Join performance
+
+Joins are a feature that can significantly affect performance of your queries. Some performance tips and notes:
+
+1. Joins are especially useful with [lookup datasources](#lookup), but in most cases, the
+[`LOOKUP` function](sql.html#string-functions) performs better than a join. Consider using the `LOOKUP` function if
+it is appropriate for your use case.
+2. When using joins in Druid SQL, keep in mind that it can generate subqueries that you did not explicitly include in
+your queries. Refer to the [Druid SQL](sql.md#query-translation) documentation for more details about when this happens
+and how to detect it.
+3. One common reason for implicit subquery generation is if the types of the two halves of an equality do not match.
+For example, since lookup keys are always strings, the condition `druid.d JOIN lookup.l ON d.field = l.field` will
+perform best if `d.field` is a string.
+4. As of Druid {{DRUIDVERSION}}, the join operator must evaluate the condition for each row. In the future, we expect
+to implement both early and deferred condition evaluation, which we expect to improve performance considerably for
+common use cases.
+
+#### Future work for joins
+
+Joins are an area of active development in Druid. The following features are missing today but may appear in
+future versions:
+
+- Preloaded dimension tables that are wider than lookups (i.e. supporting more than a single key and single value).
+- RIGHT OUTER and FULL OUTER joins. Currently, they are partially implemented. Queries will run but results will not
+always be correct.
+- Performance-related optimizations as mentioned in the [previous section](#join-performance).
+- Join algorithms other than broadcast hash-joins.
+
+### `union`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--Native-->
 ```json
 {
-       "type": "union",
-       "dataSources": ["<string_value1>", "<string_value2>", "<string_value3>", ... ]
+  "queryType": "scan",
+  "dataSource": {
+    "type": "union",
+    "dataSources": ["<dataSourceName1>", "<dataSourceName2>", "<dataSourceName3>"]
+  },
+  "columns": ["column1", "column2"],
+  "intervals": ["0000/3000"]
 }
 ```
+<!--END_DOCUSAURUS_CODE_TABS-->
 
-Note that the data sources being unioned should have the same schema.
-Union Queries should be always sent to a Broker/Router process and are *NOT* supported directly by the Historical processes.
+Union datasources allow you to treat two or more table datasources as a single datasource. The datasources being unioned
+do not need to have identical schemas. If they do not fully match up, then columns that exist in one table but not
+another will be treated as if they contained all null values in the tables where they do not exist.
 
-### Query datasource
+Union datasources are not available in Druid SQL.
 
-This is used for nested groupBys and is only currently supported for groupBys.
+### `inline`
 
 Review comment:
   Maybe move this under the query datasource section since that's what the broker uses when dealing with subqueries

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
suneet-s commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r408956209
 
 

 ##########
 File path: docs/misc/math-expr.md
 ##########
 @@ -22,9 +22,14 @@ title: "Expressions"
   ~ under the License.
   -->
 
-
-> This feature is still experimental. It has not been optimized for performance yet, and its implementation is known to
->  have significant inefficiencies.
 
 Review comment:
   What changes went in to bring these expressions out of an experimental feature?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
suneet-s commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409072313
 
 

 ##########
 File path: docs/querying/post-aggregations.md
 ##########
 @@ -1,6 +1,6 @@
 ---
 id: post-aggregations
-title: "Post-Aggregations"
+title: "Postaggregations"
 
 Review comment:
   nit: The title looks strange as one word. Is `postaggregation` a word?
   ```suggestion
   title: "Post Aggregations"
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson merged pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
jihoonson merged pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
suneet-s commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r408956859
 
 

 ##########
 File path: docs/misc/math-expr.md
 ##########
 @@ -22,9 +22,14 @@ title: "Expressions"
   ~ under the License.
   -->
 
-
-> This feature is still experimental. It has not been optimized for performance yet, and its implementation is known to
->  have significant inefficiencies.
 
 Review comment:
   oops sorry hit comment instead of finishing up my review, still going through the rest of these doc changes

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] sthetland commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
sthetland commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r408930589
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always creating or writing into a table
+datasource. But at query time, there are many other types of datasources available.
 
-A data source is the Apache Druid equivalent of a database table. However, a query can also masquerade as a data source, providing subquery-like functionality. Query data sources are currently supported only by [GroupBy](../querying/groupbyquery.md) queries.
+In the [Druid SQL](sql.html) language, datasources are provided in the [`FROM` clause](sql.html#from).
 
-### Table datasource
-The table data source is the most common type. It's represented by a string, or by the full structure:
+The word "datasource" is generally spelled `dataSource` (with a capital S) when it appears in API requests and
+responses.
 
+## Datasource type
+
+### `table`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT column1, column2 FROM "druid"."dataSourceName"
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": "dataSourceName",
+  "columns": ["column1", "column2"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+The table datasource is the most common type. This is the kind of datasource you get when you perform
+[data ingestion](../ingestion/index.html). They are split up into segments, distributed around the cluster,
+and queried in parallel.
+
+In [Druid SQL](sql.html#from), table datasources reside in the the `druid` schema. This is the default schema, so table
+datasources can be referenced as either `druid.dataSourceName` or simply `dataSourceName`.
+
+In native queries, table datasources can be referenced using their names as strings (as in the example above), or by
+using JSON objects of the form:
+
+```json
+"dataSource": {
+  "type": "table",
+  "name": "dataSourceName"
+}
+```
+
+To see a list of all table datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'druid'`.
+
+### `lookup`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT k, v FROM lookup.countries
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": {
+    "type": "lookup",
+    "lookup": "countries"
+  },
+  "columns": ["k", "v"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Lookup datasources correspond to Druid's key-value [lookup](lookups.html) objects. In [Druid SQL](sql.html#from),
+they reside in the the `lookup` schema. They are preloaded in memory on all servers, so they can be accessed rapidly.
 
 Review comment:
   ```suggestion
   they reside in the `lookup` schema. They are preloaded in memory on all servers, so they can be accessed rapidly.
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
suneet-s commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409024524
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always creating or writing into a table
+datasource. But at query time, there are many other types of datasources available.
 
-A data source is the Apache Druid equivalent of a database table. However, a query can also masquerade as a data source, providing subquery-like functionality. Query data sources are currently supported only by [GroupBy](../querying/groupbyquery.md) queries.
+In the [Druid SQL](sql.html) language, datasources are provided in the [`FROM` clause](sql.html#from).
 
-### Table datasource
-The table data source is the most common type. It's represented by a string, or by the full structure:
+The word "datasource" is generally spelled `dataSource` (with a capital S) when it appears in API requests and
+responses.
 
+## Datasource type
+
+### `table`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT column1, column2 FROM "druid"."dataSourceName"
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": "dataSourceName",
+  "columns": ["column1", "column2"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+The table datasource is the most common type. This is the kind of datasource you get when you perform
+[data ingestion](../ingestion/index.html). They are split up into segments, distributed around the cluster,
+and queried in parallel.
+
+In [Druid SQL](sql.html#from), table datasources reside in the the `druid` schema. This is the default schema, so table
+datasources can be referenced as either `druid.dataSourceName` or simply `dataSourceName`.
+
+In native queries, table datasources can be referenced using their names as strings (as in the example above), or by
+using JSON objects of the form:
+
+```json
+"dataSource": {
+  "type": "table",
+  "name": "dataSourceName"
+}
+```
+
+To see a list of all table datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'druid'`.
+
+### `lookup`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT k, v FROM lookup.countries
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": {
+    "type": "lookup",
+    "lookup": "countries"
+  },
+  "columns": ["k", "v"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Lookup datasources correspond to Druid's key-value [lookup](lookups.html) objects. In [Druid SQL](sql.html#from),
+they reside in the the `lookup` schema. They are preloaded in memory on all servers, so they can be accessed rapidly.
+They can be joined onto regular tables using the [join operator](#join).
+
+Lookup datasources are key-value oriented and always have exactly two columns: `k` (the key) and `v` (the value), and
+both are always strings.
+
+To see a list of all lookup datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'lookup'`.
+
+> Performance tip: Lookups can be joined with a base table either using an explicit [join](#join), or by using the
+> SQL [`LOOKUP` function](sql.html#string-functions).
+> However, the join operator must evaluate the condition on each row, whereas the
+> `LOOKUP` function can defer evaluation until after an aggregation phase. This means that the `LOOKUP` function is
+> usually faster than joining to a lookup datasource.
+
+### `query`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Uses a subquery to count hits per page, then takes the average.
+SELECT
+  AVG(cnt) AS average_hits_per_page
+FROM
+  (SELECT page, COUNT(*) AS hits FROM site_traffic GROUP BY page)
+```
+<!--Native-->
+```json
+{
+  "queryType": "timeseries",
+  "dataSource": {
+    "type": "query",
+    "query": {
+      "queryType": "groupBy",
+      "dataSource": "site_traffic",
+      "intervals": ["0000/3000"],
+      "granularity": "all",
+      "dimensions": ["page"],
+      "aggregations": [
+        { "type": "count", "name": "hits" }
+      ]
+    }
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "aggregations": [
+    { "type": "longSum", "name": "hits", "fieldName": "hits" },
+    { "type": "count", "name": "pages" }
+  ],
+  "postAggregations": [
+    { "type": "expression", "name": "average_hits_per_page", "expression": "hits / pages" }
+  ]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Query datasources allow you to issue subqueries. In native queries, they can appear anywhere that accepts a
+`dataSource`. In SQL, they can appear in the following places, always surrounded by parentheses:
+
+- The FROM clause: `FROM (<subquery>)`.
+- As inputs to a JOIN: `<table-or-subquery-1> t1 INNER JOIN <table-or-subquery-2> t2 ON t1.<col1> = t2.<col2>`.
+- In the WHERE clause: `WHERE <column> { IN | NOT IN } (<subquery>)`. These are translated to joins by the SQL planner.
+
+> Performance tip: Subquery results need to be fully transferred to the Broker as part of query execution.
+> This means that subqueries with large result sets can cause performance bottlenecks or run into memory usage limits.
+> See the [Query execution](query-execution.md) documentation for more details on how subqueries are executed and what
+> limits will apply.
+
+### `join`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Joins "sales" with "countries" (using "store" as the join key) to get sales by country.
+SELECT
+  countries.v AS country,
+  SUM(sales.revenue) AS country_revenue
+FROM
+  sales
+  INNER JOIN lookup.store_to_country ON sales.store = store_to_country.k
+GROUP BY
+  countries.v
+```
+<!--Native-->
 ```json
 {
-	"type": "table",
-	"name": "<string_value>"
+  "queryType": "groupBy",
+  "dataSource": {
+    "type": "join",
+    "left": "sales",
+    "right": {
+      "type": "lookup",
+      "lookup": "store_to_country"
+    },
+    "rightPrefix": "r.",
+    "condition": "store == \"r.k\"",
+    "joinType": "INNER"
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "dimensions": [
+    { "type": "default", "outputName": "country", "dimension": "r.v" }
+  ],
+  "aggregations": [
+    { "type": "longSum", "name": "country_revenue", "fieldName": "revenue" }
+  ]
 }
 ```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Join datasources allow you to do a SQL-style join of two datasources. Stacking joins on top of each other allows
+you to join arbitrarily many datasources.
+
+In Druid {{DRUIDVERSION}}, joins are implemented with a broadcast hash-join algorithm. This means that all tables
+other than the leftmost "base" table must fit in memory. It also means that the join condition must be an equality. This
+feature is intended mainly to allow joining regular Druid tables with [lookup](#lookup), [inline](#inline), and
+[query](#query) datasources.
+
+For information about how Druid executes queries involving joins, refer to the
+[Query execution](query-execution.html#join) page.
+
+#### Joins in SQL
+
+SQL joins take the form:
+
+```
+<o1> [ INNER | LEFT [OUTER] ] JOIN <o2> ON <condition>
+```
+
+The condition must involve only equalities, but functions are okay, and there can be multiple equalities ANDed together.
+Conditions like `t1.x = t2.x`, or `LOWER(t1.x) = t2.x`, or `t1.x = t2.x AND t1.y = t2.y` can all be handled. Conditions
+like `t1.x <> t2.x` cannot currently be handled.
 
-### Union datasource
+Note that Druid SQL is less rigid than what native join datasources can handle. In cases where a SQL query does
+something that is not allowed as-is with a native join datasource, Druid SQL will generate a subquery. This can have
+a substantial effect on performance and scalability, so it is something to watch out for. Some examples of when the
+SQL layer will generate subqueries include:
 
-This data source unions two or more table data sources.
+- Joining a regular Druid table to itself, or to another regular Druid table. The native join datasource can accept
+a table on the left-hand side, but not the right, so a subquery is needed.
 
+- Join conditions where the expressions on either side are of different types.
+
+- Join conditions where the right-hand expression is not a direct column access.
+
+For more information about how Druid translates SQL to native queries, refer to the
+[Druid SQL](sql.md#query-translation) documentation.
+
+#### Joins in native queries
+
+Native join datasources have the following properties. All are required.
+
+|Field|Description|
+|-----|-----------|
+|`left`|Left-hand datasource. Must be of type `table`, `join`, `lookup`, `query`, or `inline`. Placing another join as the left datasource allows you to join arbitrarily many datasources.|
+|`right`|Right-hand datasource. Must be of type `lookup`, `query`, or `inline`. Note that this is more rigid than what Druid SQL requires.|
+|`rightPrefix`|String prefix that will be applied to all columns from the right-hand datasource, to prevent them from colliding with columns from the left-hand datasource. Can be any string, so long as it is nonempty and is not be a prefix of the string `__time`. Any columns from the left-hand side that start with your `rightPrefix` will be shadowed. It is up to you to provide a prefix that will not shadow any important columns from the left side.|
+|`condition`|[Expression](../misc/math-expr.md) that must be an equality where one side is an expression of the left-hand side, and the other side is a simple column reference to the right-hand side. Note that this is more rigid than what Druid SQL requires: here, the right-hand reference must be a simple column reference; in SQL it can be an expression.|
+|`joinType`|`INNER` or `LEFT`.|
+
+#### Join performance
+
+Joins are a feature that can significantly affect performance of your queries. Some performance tips and notes:
+
+1. Joins are especially useful with [lookup datasources](#lookup), but in most cases, the
+[`LOOKUP` function](sql.html#string-functions) performs better than a join. Consider using the `LOOKUP` function if
+it is appropriate for your use case.
+2. When using joins in Druid SQL, keep in mind that it can generate subqueries that you did not explicitly include in
+your queries. Refer to the [Druid SQL](sql.md#query-translation) documentation for more details about when this happens
+and how to detect it.
+3. One common reason for implicit subquery generation is if the types of the two halves of an equality do not match.
+For example, since lookup keys are always strings, the condition `druid.d JOIN lookup.l ON d.field = l.field` will
+perform best if `d.field` is a string.
+4. As of Druid {{DRUIDVERSION}}, the join operator must evaluate the condition for each row. In the future, we expect
 
 Review comment:
   I think this is an extension of the first point about using the LOOKUP function instead of a join?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409082142
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always creating or writing into a table
+datasource. But at query time, there are many other types of datasources available.
 
-A data source is the Apache Druid equivalent of a database table. However, a query can also masquerade as a data source, providing subquery-like functionality. Query data sources are currently supported only by [GroupBy](../querying/groupbyquery.md) queries.
+In the [Druid SQL](sql.html) language, datasources are provided in the [`FROM` clause](sql.html#from).
 
-### Table datasource
-The table data source is the most common type. It's represented by a string, or by the full structure:
+The word "datasource" is generally spelled `dataSource` (with a capital S) when it appears in API requests and
+responses.
 
+## Datasource type
+
+### `table`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT column1, column2 FROM "druid"."dataSourceName"
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": "dataSourceName",
+  "columns": ["column1", "column2"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+The table datasource is the most common type. This is the kind of datasource you get when you perform
+[data ingestion](../ingestion/index.html). They are split up into segments, distributed around the cluster,
+and queried in parallel.
+
+In [Druid SQL](sql.html#from), table datasources reside in the the `druid` schema. This is the default schema, so table
+datasources can be referenced as either `druid.dataSourceName` or simply `dataSourceName`.
+
+In native queries, table datasources can be referenced using their names as strings (as in the example above), or by
+using JSON objects of the form:
+
+```json
+"dataSource": {
+  "type": "table",
+  "name": "dataSourceName"
+}
+```
+
+To see a list of all table datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'druid'`.
+
+### `lookup`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT k, v FROM lookup.countries
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": {
+    "type": "lookup",
+    "lookup": "countries"
+  },
+  "columns": ["k", "v"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Lookup datasources correspond to Druid's key-value [lookup](lookups.html) objects. In [Druid SQL](sql.html#from),
+they reside in the the `lookup` schema. They are preloaded in memory on all servers, so they can be accessed rapidly.
+They can be joined onto regular tables using the [join operator](#join).
+
+Lookup datasources are key-value oriented and always have exactly two columns: `k` (the key) and `v` (the value), and
+both are always strings.
+
+To see a list of all lookup datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'lookup'`.
+
+> Performance tip: Lookups can be joined with a base table either using an explicit [join](#join), or by using the
+> SQL [`LOOKUP` function](sql.html#string-functions).
+> However, the join operator must evaluate the condition on each row, whereas the
+> `LOOKUP` function can defer evaluation until after an aggregation phase. This means that the `LOOKUP` function is
+> usually faster than joining to a lookup datasource.
+
+### `query`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Uses a subquery to count hits per page, then takes the average.
+SELECT
+  AVG(cnt) AS average_hits_per_page
+FROM
+  (SELECT page, COUNT(*) AS hits FROM site_traffic GROUP BY page)
+```
+<!--Native-->
+```json
+{
+  "queryType": "timeseries",
+  "dataSource": {
+    "type": "query",
+    "query": {
+      "queryType": "groupBy",
+      "dataSource": "site_traffic",
+      "intervals": ["0000/3000"],
+      "granularity": "all",
+      "dimensions": ["page"],
+      "aggregations": [
+        { "type": "count", "name": "hits" }
+      ]
+    }
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "aggregations": [
+    { "type": "longSum", "name": "hits", "fieldName": "hits" },
+    { "type": "count", "name": "pages" }
+  ],
+  "postAggregations": [
+    { "type": "expression", "name": "average_hits_per_page", "expression": "hits / pages" }
+  ]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Query datasources allow you to issue subqueries. In native queries, they can appear anywhere that accepts a
+`dataSource`. In SQL, they can appear in the following places, always surrounded by parentheses:
+
+- The FROM clause: `FROM (<subquery>)`.
+- As inputs to a JOIN: `<table-or-subquery-1> t1 INNER JOIN <table-or-subquery-2> t2 ON t1.<col1> = t2.<col2>`.
+- In the WHERE clause: `WHERE <column> { IN | NOT IN } (<subquery>)`. These are translated to joins by the SQL planner.
+
+> Performance tip: Subquery results need to be fully transferred to the Broker as part of query execution.
+> This means that subqueries with large result sets can cause performance bottlenecks or run into memory usage limits.
+> See the [Query execution](query-execution.md) documentation for more details on how subqueries are executed and what
+> limits will apply.
+
+### `join`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Joins "sales" with "countries" (using "store" as the join key) to get sales by country.
+SELECT
+  countries.v AS country,
+  SUM(sales.revenue) AS country_revenue
+FROM
+  sales
+  INNER JOIN lookup.store_to_country ON sales.store = store_to_country.k
+GROUP BY
+  countries.v
+```
+<!--Native-->
 ```json
 {
-	"type": "table",
-	"name": "<string_value>"
+  "queryType": "groupBy",
+  "dataSource": {
+    "type": "join",
+    "left": "sales",
+    "right": {
+      "type": "lookup",
+      "lookup": "store_to_country"
+    },
+    "rightPrefix": "r.",
+    "condition": "store == \"r.k\"",
+    "joinType": "INNER"
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "dimensions": [
+    { "type": "default", "outputName": "country", "dimension": "r.v" }
+  ],
+  "aggregations": [
+    { "type": "longSum", "name": "country_revenue", "fieldName": "revenue" }
+  ]
 }
 ```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Join datasources allow you to do a SQL-style join of two datasources. Stacking joins on top of each other allows
+you to join arbitrarily many datasources.
+
+In Druid {{DRUIDVERSION}}, joins are implemented with a broadcast hash-join algorithm. This means that all tables
+other than the leftmost "base" table must fit in memory. It also means that the join condition must be an equality. This
+feature is intended mainly to allow joining regular Druid tables with [lookup](#lookup), [inline](#inline), and
+[query](#query) datasources.
+
+For information about how Druid executes queries involving joins, refer to the
+[Query execution](query-execution.html#join) page.
+
+#### Joins in SQL
+
+SQL joins take the form:
+
+```
+<o1> [ INNER | LEFT [OUTER] ] JOIN <o2> ON <condition>
+```
+
+The condition must involve only equalities, but functions are okay, and there can be multiple equalities ANDed together.
+Conditions like `t1.x = t2.x`, or `LOWER(t1.x) = t2.x`, or `t1.x = t2.x AND t1.y = t2.y` can all be handled. Conditions
+like `t1.x <> t2.x` cannot currently be handled.
 
-### Union datasource
+Note that Druid SQL is less rigid than what native join datasources can handle. In cases where a SQL query does
+something that is not allowed as-is with a native join datasource, Druid SQL will generate a subquery. This can have
+a substantial effect on performance and scalability, so it is something to watch out for. Some examples of when the
+SQL layer will generate subqueries include:
 
-This data source unions two or more table data sources.
+- Joining a regular Druid table to itself, or to another regular Druid table. The native join datasource can accept
+a table on the left-hand side, but not the right, so a subquery is needed.
 
+- Join conditions where the expressions on either side are of different types.
+
+- Join conditions where the right-hand expression is not a direct column access.
+
+For more information about how Druid translates SQL to native queries, refer to the
+[Druid SQL](sql.md#query-translation) documentation.
+
+#### Joins in native queries
+
+Native join datasources have the following properties. All are required.
+
+|Field|Description|
+|-----|-----------|
+|`left`|Left-hand datasource. Must be of type `table`, `join`, `lookup`, `query`, or `inline`. Placing another join as the left datasource allows you to join arbitrarily many datasources.|
+|`right`|Right-hand datasource. Must be of type `lookup`, `query`, or `inline`. Note that this is more rigid than what Druid SQL requires.|
+|`rightPrefix`|String prefix that will be applied to all columns from the right-hand datasource, to prevent them from colliding with columns from the left-hand datasource. Can be any string, so long as it is nonempty and is not be a prefix of the string `__time`. Any columns from the left-hand side that start with your `rightPrefix` will be shadowed. It is up to you to provide a prefix that will not shadow any important columns from the left side.|
+|`condition`|[Expression](../misc/math-expr.md) that must be an equality where one side is an expression of the left-hand side, and the other side is a simple column reference to the right-hand side. Note that this is more rigid than what Druid SQL requires: here, the right-hand reference must be a simple column reference; in SQL it can be an expression.|
+|`joinType`|`INNER` or `LEFT`.|
+
+#### Join performance
+
+Joins are a feature that can significantly affect performance of your queries. Some performance tips and notes:
+
+1. Joins are especially useful with [lookup datasources](#lookup), but in most cases, the
+[`LOOKUP` function](sql.html#string-functions) performs better than a join. Consider using the `LOOKUP` function if
+it is appropriate for your use case.
+2. When using joins in Druid SQL, keep in mind that it can generate subqueries that you did not explicitly include in
+your queries. Refer to the [Druid SQL](sql.md#query-translation) documentation for more details about when this happens
+and how to detect it.
+3. One common reason for implicit subquery generation is if the types of the two halves of an equality do not match.
+For example, since lookup keys are always strings, the condition `druid.d JOIN lookup.l ON d.field = l.field` will
+perform best if `d.field` is a string.
+4. As of Druid {{DRUIDVERSION}}, the join operator must evaluate the condition for each row. In the future, we expect
+to implement both early and deferred condition evaluation, which we expect to improve performance considerably for
+common use cases.
+
+#### Future work for joins
+
+Joins are an area of active development in Druid. The following features are missing today but may appear in
+future versions:
+
+- Preloaded dimension tables that are wider than lookups (i.e. supporting more than a single key and single value).
+- RIGHT OUTER and FULL OUTER joins. Currently, they are partially implemented. Queries will run but results will not
+always be correct.
+- Performance-related optimizations as mentioned in the [previous section](#join-performance).
+- Join algorithms other than broadcast hash-joins.
+
+### `union`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--Native-->
 ```json
 {
-       "type": "union",
-       "dataSources": ["<string_value1>", "<string_value2>", "<string_value3>", ... ]
+  "queryType": "scan",
+  "dataSource": {
+    "type": "union",
+    "dataSources": ["<dataSourceName1>", "<dataSourceName2>", "<dataSourceName3>"]
 
 Review comment:
   Sure, sounds good.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409081444
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always creating or writing into a table
+datasource. But at query time, there are many other types of datasources available.
 
-A data source is the Apache Druid equivalent of a database table. However, a query can also masquerade as a data source, providing subquery-like functionality. Query data sources are currently supported only by [GroupBy](../querying/groupbyquery.md) queries.
+In the [Druid SQL](sql.html) language, datasources are provided in the [`FROM` clause](sql.html#from).
 
-### Table datasource
-The table data source is the most common type. It's represented by a string, or by the full structure:
+The word "datasource" is generally spelled `dataSource` (with a capital S) when it appears in API requests and
+responses.
 
+## Datasource type
+
+### `table`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT column1, column2 FROM "druid"."dataSourceName"
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": "dataSourceName",
+  "columns": ["column1", "column2"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+The table datasource is the most common type. This is the kind of datasource you get when you perform
+[data ingestion](../ingestion/index.html). They are split up into segments, distributed around the cluster,
+and queried in parallel.
+
+In [Druid SQL](sql.html#from), table datasources reside in the the `druid` schema. This is the default schema, so table
+datasources can be referenced as either `druid.dataSourceName` or simply `dataSourceName`.
+
+In native queries, table datasources can be referenced using their names as strings (as in the example above), or by
+using JSON objects of the form:
+
+```json
+"dataSource": {
+  "type": "table",
+  "name": "dataSourceName"
+}
+```
+
+To see a list of all table datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'druid'`.
+
+### `lookup`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT k, v FROM lookup.countries
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": {
+    "type": "lookup",
+    "lookup": "countries"
+  },
+  "columns": ["k", "v"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Lookup datasources correspond to Druid's key-value [lookup](lookups.html) objects. In [Druid SQL](sql.html#from),
+they reside in the the `lookup` schema. They are preloaded in memory on all servers, so they can be accessed rapidly.
+They can be joined onto regular tables using the [join operator](#join).
+
+Lookup datasources are key-value oriented and always have exactly two columns: `k` (the key) and `v` (the value), and
+both are always strings.
+
+To see a list of all lookup datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'lookup'`.
+
+> Performance tip: Lookups can be joined with a base table either using an explicit [join](#join), or by using the
+> SQL [`LOOKUP` function](sql.html#string-functions).
+> However, the join operator must evaluate the condition on each row, whereas the
+> `LOOKUP` function can defer evaluation until after an aggregation phase. This means that the `LOOKUP` function is
+> usually faster than joining to a lookup datasource.
+
+### `query`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Uses a subquery to count hits per page, then takes the average.
+SELECT
+  AVG(cnt) AS average_hits_per_page
+FROM
+  (SELECT page, COUNT(*) AS hits FROM site_traffic GROUP BY page)
+```
+<!--Native-->
+```json
+{
+  "queryType": "timeseries",
+  "dataSource": {
+    "type": "query",
+    "query": {
+      "queryType": "groupBy",
+      "dataSource": "site_traffic",
+      "intervals": ["0000/3000"],
+      "granularity": "all",
+      "dimensions": ["page"],
+      "aggregations": [
+        { "type": "count", "name": "hits" }
+      ]
+    }
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "aggregations": [
+    { "type": "longSum", "name": "hits", "fieldName": "hits" },
+    { "type": "count", "name": "pages" }
+  ],
+  "postAggregations": [
+    { "type": "expression", "name": "average_hits_per_page", "expression": "hits / pages" }
+  ]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Query datasources allow you to issue subqueries. In native queries, they can appear anywhere that accepts a
+`dataSource`. In SQL, they can appear in the following places, always surrounded by parentheses:
+
+- The FROM clause: `FROM (<subquery>)`.
+- As inputs to a JOIN: `<table-or-subquery-1> t1 INNER JOIN <table-or-subquery-2> t2 ON t1.<col1> = t2.<col2>`.
+- In the WHERE clause: `WHERE <column> { IN | NOT IN } (<subquery>)`. These are translated to joins by the SQL planner.
+
+> Performance tip: Subquery results need to be fully transferred to the Broker as part of query execution.
+> This means that subqueries with large result sets can cause performance bottlenecks or run into memory usage limits.
+> See the [Query execution](query-execution.md) documentation for more details on how subqueries are executed and what
+> limits will apply.
+
+### `join`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Joins "sales" with "countries" (using "store" as the join key) to get sales by country.
+SELECT
+  countries.v AS country,
+  SUM(sales.revenue) AS country_revenue
+FROM
+  sales
+  INNER JOIN lookup.store_to_country ON sales.store = store_to_country.k
+GROUP BY
+  countries.v
+```
+<!--Native-->
 ```json
 {
-	"type": "table",
-	"name": "<string_value>"
+  "queryType": "groupBy",
+  "dataSource": {
+    "type": "join",
+    "left": "sales",
+    "right": {
+      "type": "lookup",
+      "lookup": "store_to_country"
+    },
+    "rightPrefix": "r.",
+    "condition": "store == \"r.k\"",
+    "joinType": "INNER"
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "dimensions": [
+    { "type": "default", "outputName": "country", "dimension": "r.v" }
+  ],
+  "aggregations": [
+    { "type": "longSum", "name": "country_revenue", "fieldName": "revenue" }
+  ]
 }
 ```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Join datasources allow you to do a SQL-style join of two datasources. Stacking joins on top of each other allows
+you to join arbitrarily many datasources.
+
+In Druid {{DRUIDVERSION}}, joins are implemented with a broadcast hash-join algorithm. This means that all tables
+other than the leftmost "base" table must fit in memory. It also means that the join condition must be an equality. This
+feature is intended mainly to allow joining regular Druid tables with [lookup](#lookup), [inline](#inline), and
+[query](#query) datasources.
+
+For information about how Druid executes queries involving joins, refer to the
+[Query execution](query-execution.html#join) page.
+
+#### Joins in SQL
+
+SQL joins take the form:
+
+```
+<o1> [ INNER | LEFT [OUTER] ] JOIN <o2> ON <condition>
+```
+
+The condition must involve only equalities, but functions are okay, and there can be multiple equalities ANDed together.
+Conditions like `t1.x = t2.x`, or `LOWER(t1.x) = t2.x`, or `t1.x = t2.x AND t1.y = t2.y` can all be handled. Conditions
+like `t1.x <> t2.x` cannot currently be handled.
 
-### Union datasource
+Note that Druid SQL is less rigid than what native join datasources can handle. In cases where a SQL query does
+something that is not allowed as-is with a native join datasource, Druid SQL will generate a subquery. This can have
+a substantial effect on performance and scalability, so it is something to watch out for. Some examples of when the
+SQL layer will generate subqueries include:
 
-This data source unions two or more table data sources.
+- Joining a regular Druid table to itself, or to another regular Druid table. The native join datasource can accept
+a table on the left-hand side, but not the right, so a subquery is needed.
 
+- Join conditions where the expressions on either side are of different types.
+
+- Join conditions where the right-hand expression is not a direct column access.
+
+For more information about how Druid translates SQL to native queries, refer to the
+[Druid SQL](sql.md#query-translation) documentation.
+
+#### Joins in native queries
+
+Native join datasources have the following properties. All are required.
+
+|Field|Description|
+|-----|-----------|
+|`left`|Left-hand datasource. Must be of type `table`, `join`, `lookup`, `query`, or `inline`. Placing another join as the left datasource allows you to join arbitrarily many datasources.|
+|`right`|Right-hand datasource. Must be of type `lookup`, `query`, or `inline`. Note that this is more rigid than what Druid SQL requires.|
+|`rightPrefix`|String prefix that will be applied to all columns from the right-hand datasource, to prevent them from colliding with columns from the left-hand datasource. Can be any string, so long as it is nonempty and is not be a prefix of the string `__time`. Any columns from the left-hand side that start with your `rightPrefix` will be shadowed. It is up to you to provide a prefix that will not shadow any important columns from the left side.|
+|`condition`|[Expression](../misc/math-expr.md) that must be an equality where one side is an expression of the left-hand side, and the other side is a simple column reference to the right-hand side. Note that this is more rigid than what Druid SQL requires: here, the right-hand reference must be a simple column reference; in SQL it can be an expression.|
+|`joinType`|`INNER` or `LEFT`.|
+
+#### Join performance
+
+Joins are a feature that can significantly affect performance of your queries. Some performance tips and notes:
+
+1. Joins are especially useful with [lookup datasources](#lookup), but in most cases, the
+[`LOOKUP` function](sql.html#string-functions) performs better than a join. Consider using the `LOOKUP` function if
+it is appropriate for your use case.
+2. When using joins in Druid SQL, keep in mind that it can generate subqueries that you did not explicitly include in
+your queries. Refer to the [Druid SQL](sql.md#query-translation) documentation for more details about when this happens
+and how to detect it.
+3. One common reason for implicit subquery generation is if the types of the two halves of an equality do not match.
+For example, since lookup keys are always strings, the condition `druid.d JOIN lookup.l ON d.field = l.field` will
+perform best if `d.field` is a string.
+4. As of Druid {{DRUIDVERSION}}, the join operator must evaluate the condition for each row. In the future, we expect
+to implement both early and deferred condition evaluation, which we expect to improve performance considerably for
+common use cases.
+
+#### Future work for joins
+
+Joins are an area of active development in Druid. The following features are missing today but may appear in
+future versions:
+
+- Preloaded dimension tables that are wider than lookups (i.e. supporting more than a single key and single value).
+- RIGHT OUTER and FULL OUTER joins. Currently, they are partially implemented. Queries will run but results will not
+always be correct.
+- Performance-related optimizations as mentioned in the [previous section](#join-performance).
+- Join algorithms other than broadcast hash-joins.
+
+### `union`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--Native-->
 ```json
 {
-       "type": "union",
-       "dataSources": ["<string_value1>", "<string_value2>", "<string_value3>", ... ]
+  "queryType": "scan",
+  "dataSource": {
+    "type": "union",
+    "dataSources": ["<dataSourceName1>", "<dataSourceName2>", "<dataSourceName3>"]
+  },
+  "columns": ["column1", "column2"],
+  "intervals": ["0000/3000"]
 }
 ```
+<!--END_DOCUSAURUS_CODE_TABS-->
 
-Note that the data sources being unioned should have the same schema.
-Union Queries should be always sent to a Broker/Router process and are *NOT* supported directly by the Historical processes.
+Union datasources allow you to treat two or more table datasources as a single datasource. The datasources being unioned
+do not need to have identical schemas. If they do not fully match up, then columns that exist in one table but not
+another will be treated as if they contained all null values in the tables where they do not exist.
 
 Review comment:
   yes, they're discussed in query-execution. I'll add a link.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
suneet-s commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409065984
 
 

 ##########
 File path: docs/querying/joins.md
 ##########
 @@ -22,33 +22,13 @@ title: "Joins"
   ~ under the License.
   -->
 
+Druid has two features related to joining of data:
 
-Apache Druid has limited support for joins through [query-time lookups](../querying/lookups.md). The common use case of
-query-time lookups is to replace one dimension value (e.g. a String ID) with another value (e.g. a human-readable String value). This is similar to a star-schema join.
+1. [Join](datasource.md#join) operators. These are available using a [join datasource](datasource.md#join) in native
+queries, or using the [JOIN operator](sql.md#query-syntax) in Druid SQL. Refer to the
+[join datasource](datasource.md#join) documentation for information about how joins work in Druid.
+2. [Query-time lookups](lookups.md), simple key-to-value mappings. These are preloaded on all servers and can be
+accessed with or without an explicit join operator. Refer to the [lookups](lookups.md) documentation for more details.
 
-Druid does not yet have full support for joins. Although Druid’s storage format would allow for the implementation
-of joins (there is no loss of fidelity for columns included as dimensions), full support for joins have not yet been implemented yet
-for the following reasons:
-
-1. Scaling join queries has been, in our professional experience,
-a constant bottleneck of working with distributed databases.
-2. The incremental gains in functionality are perceived to be
-of less value than the anticipated problems with managing
-highly concurrent, join-heavy workloads.
-
-A join query is essentially the merging of two or more streams of data based on a shared set of keys. The primary
-high-level strategies for join queries we are aware of are a hash-based strategy or a
-sorted-merge strategy. The hash-based strategy requires that all but
-one data set be available as something that looks like a hash table,
-a lookup operation is then performed on this hash table for every
-row in the “primary” stream. The sorted-merge strategy assumes
-that each stream is sorted by the join key and thus allows for the incremental
-joining of the streams. Each of these strategies, however,
-requires the materialization of some number of the streams either in
-sorted order or in a hash table form.
-
-When all sides of the join are significantly large tables (> 1 billion
-records), materializing the pre-join streams requires complex
-distributed memory management. The complexity of the memory
-management is only amplified by the fact that we are targeting highly
-concurrent, multi-tenant workloads.
+Whenever possible, for best performance it is good to avoid joins at query time. Often this can be accomplished by
+joining data before it is loaded into Druid.
 
 Review comment:
   As a counter-balance to this point, I think we should discuss the motivations for using some form of join functionality with the caveat that performance is slower. eg. Dimensions that change after ingestion, reduce ingestion cost for some performance loss.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] sthetland commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
sthetland commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r408977569
 
 

 ##########
 File path: website/i18n/en.json
 ##########
 @@ -408,17 +417,21 @@
         "title": "Multi-value dimensions"
       },
       "querying/multitenancy": {
-        "title": "Multitenancy considerations"
+        "title": "Multitenancy considerations",
+        "sidebar_label": "Multitenancy"
       },
       "querying/post-aggregations": {
-        "title": "Post-Aggregations"
+        "title": "Postaggregations"
 
 Review comment:
   is this intentional? In the page text, it's mostly hyphenated, but sometimes two words. Or we could use the code form like groupBy, etc, which would then be "postAggregation".. okay, but not preferred, I'd say 
   
   ```suggestion
           "title": "postAggregation"
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409080924
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always creating or writing into a table
+datasource. But at query time, there are many other types of datasources available.
 
-A data source is the Apache Druid equivalent of a database table. However, a query can also masquerade as a data source, providing subquery-like functionality. Query data sources are currently supported only by [GroupBy](../querying/groupbyquery.md) queries.
+In the [Druid SQL](sql.html) language, datasources are provided in the [`FROM` clause](sql.html#from).
 
-### Table datasource
-The table data source is the most common type. It's represented by a string, or by the full structure:
+The word "datasource" is generally spelled `dataSource` (with a capital S) when it appears in API requests and
+responses.
 
+## Datasource type
+
+### `table`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT column1, column2 FROM "druid"."dataSourceName"
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": "dataSourceName",
+  "columns": ["column1", "column2"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+The table datasource is the most common type. This is the kind of datasource you get when you perform
+[data ingestion](../ingestion/index.html). They are split up into segments, distributed around the cluster,
+and queried in parallel.
+
+In [Druid SQL](sql.html#from), table datasources reside in the the `druid` schema. This is the default schema, so table
+datasources can be referenced as either `druid.dataSourceName` or simply `dataSourceName`.
+
+In native queries, table datasources can be referenced using their names as strings (as in the example above), or by
+using JSON objects of the form:
+
+```json
+"dataSource": {
+  "type": "table",
+  "name": "dataSourceName"
+}
+```
+
+To see a list of all table datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'druid'`.
+
+### `lookup`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT k, v FROM lookup.countries
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": {
+    "type": "lookup",
+    "lookup": "countries"
+  },
+  "columns": ["k", "v"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Lookup datasources correspond to Druid's key-value [lookup](lookups.html) objects. In [Druid SQL](sql.html#from),
+they reside in the the `lookup` schema. They are preloaded in memory on all servers, so they can be accessed rapidly.
+They can be joined onto regular tables using the [join operator](#join).
+
+Lookup datasources are key-value oriented and always have exactly two columns: `k` (the key) and `v` (the value), and
+both are always strings.
+
+To see a list of all lookup datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'lookup'`.
+
+> Performance tip: Lookups can be joined with a base table either using an explicit [join](#join), or by using the
+> SQL [`LOOKUP` function](sql.html#string-functions).
+> However, the join operator must evaluate the condition on each row, whereas the
+> `LOOKUP` function can defer evaluation until after an aggregation phase. This means that the `LOOKUP` function is
+> usually faster than joining to a lookup datasource.
+
+### `query`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Uses a subquery to count hits per page, then takes the average.
+SELECT
+  AVG(cnt) AS average_hits_per_page
+FROM
+  (SELECT page, COUNT(*) AS hits FROM site_traffic GROUP BY page)
+```
+<!--Native-->
+```json
+{
+  "queryType": "timeseries",
+  "dataSource": {
+    "type": "query",
+    "query": {
+      "queryType": "groupBy",
+      "dataSource": "site_traffic",
+      "intervals": ["0000/3000"],
+      "granularity": "all",
+      "dimensions": ["page"],
+      "aggregations": [
+        { "type": "count", "name": "hits" }
+      ]
+    }
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "aggregations": [
+    { "type": "longSum", "name": "hits", "fieldName": "hits" },
+    { "type": "count", "name": "pages" }
+  ],
+  "postAggregations": [
+    { "type": "expression", "name": "average_hits_per_page", "expression": "hits / pages" }
+  ]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Query datasources allow you to issue subqueries. In native queries, they can appear anywhere that accepts a
+`dataSource`. In SQL, they can appear in the following places, always surrounded by parentheses:
+
+- The FROM clause: `FROM (<subquery>)`.
+- As inputs to a JOIN: `<table-or-subquery-1> t1 INNER JOIN <table-or-subquery-2> t2 ON t1.<col1> = t2.<col2>`.
+- In the WHERE clause: `WHERE <column> { IN | NOT IN } (<subquery>)`. These are translated to joins by the SQL planner.
+
+> Performance tip: Subquery results need to be fully transferred to the Broker as part of query execution.
+> This means that subqueries with large result sets can cause performance bottlenecks or run into memory usage limits.
+> See the [Query execution](query-execution.md) documentation for more details on how subqueries are executed and what
+> limits will apply.
+
+### `join`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Joins "sales" with "countries" (using "store" as the join key) to get sales by country.
+SELECT
+  countries.v AS country,
 
 Review comment:
   Yes it should. Thanks.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
suneet-s commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409133275
 
 

 ##########
 File path: docs/querying/lookups.md
 ##########
 @@ -55,21 +56,37 @@ Other lookup types are available as extensions, including:
 Query Syntax
 ------------
 
-In [Druid SQL](sql.html), lookups can be queried using the `LOOKUP` function, for example:
+In [Druid SQL](sql.html), lookups can be queried using the [`LOOKUP` function](sql.md#string-functions), for example:
 
+```sql
+SELECT
+  LOOKUP(store, 'store_to_country') AS country,
+  SUM(revenue)
+FROM sales
+GROUP BY 1
 ```
-SELECT LOOKUP(column_name, 'lookup-name'), COUNT(*) FROM datasource GROUP BY 1
+
+They can also be queried using the [JOIN operator](datasource.md#join):
 
 Review comment:
   ^ fair point, this documentation makes me think there are only 2 ways to query a lookup, but maybe it's because I'm biased because I've tried in the past and have not been able to query it before that PR :)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
suneet-s commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409131578
 
 

 ##########
 File path: docs/querying/joins.md
 ##########
 @@ -22,33 +22,13 @@ title: "Joins"
   ~ under the License.
   -->
 
+Druid has two features related to joining of data:
 
-Apache Druid has limited support for joins through [query-time lookups](../querying/lookups.md). The common use case of
-query-time lookups is to replace one dimension value (e.g. a String ID) with another value (e.g. a human-readable String value). This is similar to a star-schema join.
+1. [Join](datasource.md#join) operators. These are available using a [join datasource](datasource.md#join) in native
+queries, or using the [JOIN operator](sql.md#query-syntax) in Druid SQL. Refer to the
+[join datasource](datasource.md#join) documentation for information about how joins work in Druid.
+2. [Query-time lookups](lookups.md), simple key-to-value mappings. These are preloaded on all servers and can be
+accessed with or without an explicit join operator. Refer to the [lookups](lookups.md) documentation for more details.
 
-Druid does not yet have full support for joins. Although Druid’s storage format would allow for the implementation
-of joins (there is no loss of fidelity for columns included as dimensions), full support for joins have not yet been implemented yet
-for the following reasons:
-
-1. Scaling join queries has been, in our professional experience,
-a constant bottleneck of working with distributed databases.
-2. The incremental gains in functionality are perceived to be
-of less value than the anticipated problems with managing
-highly concurrent, join-heavy workloads.
-
-A join query is essentially the merging of two or more streams of data based on a shared set of keys. The primary
-high-level strategies for join queries we are aware of are a hash-based strategy or a
-sorted-merge strategy. The hash-based strategy requires that all but
-one data set be available as something that looks like a hash table,
-a lookup operation is then performed on this hash table for every
-row in the “primary” stream. The sorted-merge strategy assumes
-that each stream is sorted by the join key and thus allows for the incremental
-joining of the streams. Each of these strategies, however,
-requires the materialization of some number of the streams either in
-sorted order or in a hash table form.
-
-When all sides of the join are significantly large tables (> 1 billion
-records), materializing the pre-join streams requires complex
-distributed memory management. The complexity of the memory
-management is only amplified by the fact that we are targeting highly
-concurrent, multi-tenant workloads.
+Whenever possible, for best performance it is good to avoid joins at query time. Often this can be accomplished by
+joining data before it is loaded into Druid.
 
 Review comment:
   looks good. It appears I can't resolve these comments so I've just been emoji-ing the ones I agree with so that I don't spam you with notifications

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
jihoonson commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409078660
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always creating or writing into a table
+datasource. But at query time, there are many other types of datasources available.
 
-A data source is the Apache Druid equivalent of a database table. However, a query can also masquerade as a data source, providing subquery-like functionality. Query data sources are currently supported only by [GroupBy](../querying/groupbyquery.md) queries.
+In the [Druid SQL](sql.html) language, datasources are provided in the [`FROM` clause](sql.html#from).
 
-### Table datasource
-The table data source is the most common type. It's represented by a string, or by the full structure:
+The word "datasource" is generally spelled `dataSource` (with a capital S) when it appears in API requests and
+responses.
 
+## Datasource type
+
+### `table`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT column1, column2 FROM "druid"."dataSourceName"
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": "dataSourceName",
+  "columns": ["column1", "column2"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+The table datasource is the most common type. This is the kind of datasource you get when you perform
+[data ingestion](../ingestion/index.html). They are split up into segments, distributed around the cluster,
+and queried in parallel.
+
+In [Druid SQL](sql.html#from), table datasources reside in the the `druid` schema. This is the default schema, so table
+datasources can be referenced as either `druid.dataSourceName` or simply `dataSourceName`.
+
+In native queries, table datasources can be referenced using their names as strings (as in the example above), or by
+using JSON objects of the form:
+
+```json
+"dataSource": {
+  "type": "table",
+  "name": "dataSourceName"
+}
+```
+
+To see a list of all table datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'druid'`.
+
+### `lookup`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT k, v FROM lookup.countries
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": {
+    "type": "lookup",
+    "lookup": "countries"
+  },
+  "columns": ["k", "v"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Lookup datasources correspond to Druid's key-value [lookup](lookups.html) objects. In [Druid SQL](sql.html#from),
+they reside in the the `lookup` schema. They are preloaded in memory on all servers, so they can be accessed rapidly.
+They can be joined onto regular tables using the [join operator](#join).
+
+Lookup datasources are key-value oriented and always have exactly two columns: `k` (the key) and `v` (the value), and
+both are always strings.
+
+To see a list of all lookup datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'lookup'`.
+
+> Performance tip: Lookups can be joined with a base table either using an explicit [join](#join), or by using the
+> SQL [`LOOKUP` function](sql.html#string-functions).
+> However, the join operator must evaluate the condition on each row, whereas the
+> `LOOKUP` function can defer evaluation until after an aggregation phase. This means that the `LOOKUP` function is
+> usually faster than joining to a lookup datasource.
+
+### `query`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Uses a subquery to count hits per page, then takes the average.
+SELECT
+  AVG(cnt) AS average_hits_per_page
+FROM
+  (SELECT page, COUNT(*) AS hits FROM site_traffic GROUP BY page)
+```
+<!--Native-->
+```json
+{
+  "queryType": "timeseries",
+  "dataSource": {
+    "type": "query",
+    "query": {
+      "queryType": "groupBy",
+      "dataSource": "site_traffic",
+      "intervals": ["0000/3000"],
+      "granularity": "all",
+      "dimensions": ["page"],
+      "aggregations": [
+        { "type": "count", "name": "hits" }
+      ]
+    }
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "aggregations": [
+    { "type": "longSum", "name": "hits", "fieldName": "hits" },
+    { "type": "count", "name": "pages" }
+  ],
+  "postAggregations": [
+    { "type": "expression", "name": "average_hits_per_page", "expression": "hits / pages" }
+  ]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Query datasources allow you to issue subqueries. In native queries, they can appear anywhere that accepts a
+`dataSource`. In SQL, they can appear in the following places, always surrounded by parentheses:
+
+- The FROM clause: `FROM (<subquery>)`.
+- As inputs to a JOIN: `<table-or-subquery-1> t1 INNER JOIN <table-or-subquery-2> t2 ON t1.<col1> = t2.<col2>`.
+- In the WHERE clause: `WHERE <column> { IN | NOT IN } (<subquery>)`. These are translated to joins by the SQL planner.
+
+> Performance tip: Subquery results need to be fully transferred to the Broker as part of query execution.
+> This means that subqueries with large result sets can cause performance bottlenecks or run into memory usage limits.
+> See the [Query execution](query-execution.md) documentation for more details on how subqueries are executed and what
+> limits will apply.
+
+### `join`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Joins "sales" with "countries" (using "store" as the join key) to get sales by country.
+SELECT
+  countries.v AS country,
+  SUM(sales.revenue) AS country_revenue
+FROM
+  sales
+  INNER JOIN lookup.store_to_country ON sales.store = store_to_country.k
+GROUP BY
+  countries.v
+```
+<!--Native-->
 ```json
 {
-	"type": "table",
-	"name": "<string_value>"
+  "queryType": "groupBy",
+  "dataSource": {
+    "type": "join",
+    "left": "sales",
+    "right": {
+      "type": "lookup",
+      "lookup": "store_to_country"
+    },
+    "rightPrefix": "r.",
+    "condition": "store == \"r.k\"",
+    "joinType": "INNER"
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "dimensions": [
+    { "type": "default", "outputName": "country", "dimension": "r.v" }
+  ],
+  "aggregations": [
+    { "type": "longSum", "name": "country_revenue", "fieldName": "revenue" }
+  ]
 }
 ```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Join datasources allow you to do a SQL-style join of two datasources. Stacking joins on top of each other allows
+you to join arbitrarily many datasources.
+
+In Druid {{DRUIDVERSION}}, joins are implemented with a broadcast hash-join algorithm. This means that all tables
+other than the leftmost "base" table must fit in memory. It also means that the join condition must be an equality. This
+feature is intended mainly to allow joining regular Druid tables with [lookup](#lookup), [inline](#inline), and
+[query](#query) datasources.
+
+For information about how Druid executes queries involving joins, refer to the
+[Query execution](query-execution.html#join) page.
+
+#### Joins in SQL
+
+SQL joins take the form:
+
+```
+<o1> [ INNER | LEFT [OUTER] ] JOIN <o2> ON <condition>
+```
+
+The condition must involve only equalities, but functions are okay, and there can be multiple equalities ANDed together.
+Conditions like `t1.x = t2.x`, or `LOWER(t1.x) = t2.x`, or `t1.x = t2.x AND t1.y = t2.y` can all be handled. Conditions
+like `t1.x <> t2.x` cannot currently be handled.
 
-### Union datasource
+Note that Druid SQL is less rigid than what native join datasources can handle. In cases where a SQL query does
+something that is not allowed as-is with a native join datasource, Druid SQL will generate a subquery. This can have
+a substantial effect on performance and scalability, so it is something to watch out for. Some examples of when the
+SQL layer will generate subqueries include:
 
-This data source unions two or more table data sources.
+- Joining a regular Druid table to itself, or to another regular Druid table. The native join datasource can accept
+a table on the left-hand side, but not the right, so a subquery is needed.
 
+- Join conditions where the expressions on either side are of different types.
+
+- Join conditions where the right-hand expression is not a direct column access.
+
+For more information about how Druid translates SQL to native queries, refer to the
+[Druid SQL](sql.md#query-translation) documentation.
+
+#### Joins in native queries
+
+Native join datasources have the following properties. All are required.
+
+|Field|Description|
+|-----|-----------|
+|`left`|Left-hand datasource. Must be of type `table`, `join`, `lookup`, `query`, or `inline`. Placing another join as the left datasource allows you to join arbitrarily many datasources.|
+|`right`|Right-hand datasource. Must be of type `lookup`, `query`, or `inline`. Note that this is more rigid than what Druid SQL requires.|
+|`rightPrefix`|String prefix that will be applied to all columns from the right-hand datasource, to prevent them from colliding with columns from the left-hand datasource. Can be any string, so long as it is nonempty and is not be a prefix of the string `__time`. Any columns from the left-hand side that start with your `rightPrefix` will be shadowed. It is up to you to provide a prefix that will not shadow any important columns from the left side.|
+|`condition`|[Expression](../misc/math-expr.md) that must be an equality where one side is an expression of the left-hand side, and the other side is a simple column reference to the right-hand side. Note that this is more rigid than what Druid SQL requires: here, the right-hand reference must be a simple column reference; in SQL it can be an expression.|
+|`joinType`|`INNER` or `LEFT`.|
+
+#### Join performance
+
+Joins are a feature that can significantly affect performance of your queries. Some performance tips and notes:
+
+1. Joins are especially useful with [lookup datasources](#lookup), but in most cases, the
+[`LOOKUP` function](sql.html#string-functions) performs better than a join. Consider using the `LOOKUP` function if
+it is appropriate for your use case.
+2. When using joins in Druid SQL, keep in mind that it can generate subqueries that you did not explicitly include in
+your queries. Refer to the [Druid SQL](sql.md#query-translation) documentation for more details about when this happens
+and how to detect it.
+3. One common reason for implicit subquery generation is if the types of the two halves of an equality do not match.
+For example, since lookup keys are always strings, the condition `druid.d JOIN lookup.l ON d.field = l.field` will
+perform best if `d.field` is a string.
+4. As of Druid {{DRUIDVERSION}}, the join operator must evaluate the condition for each row. In the future, we expect
+to implement both early and deferred condition evaluation, which we expect to improve performance considerably for
+common use cases.
+
+#### Future work for joins
+
+Joins are an area of active development in Druid. The following features are missing today but may appear in
+future versions:
+
+- Preloaded dimension tables that are wider than lookups (i.e. supporting more than a single key and single value).
+- RIGHT OUTER and FULL OUTER joins. Currently, they are partially implemented. Queries will run but results will not
+always be correct.
+- Performance-related optimizations as mentioned in the [previous section](#join-performance).
+- Join algorithms other than broadcast hash-joins.
+
+### `union`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--Native-->
 ```json
 {
-       "type": "union",
-       "dataSources": ["<string_value1>", "<string_value2>", "<string_value3>", ... ]
+  "queryType": "scan",
+  "dataSource": {
+    "type": "union",
+    "dataSources": ["<dataSourceName1>", "<dataSourceName2>", "<dataSourceName3>"]
+  },
+  "columns": ["column1", "column2"],
+  "intervals": ["0000/3000"]
 }
 ```
+<!--END_DOCUSAURUS_CODE_TABS-->
 
-Note that the data sources being unioned should have the same schema.
-Union Queries should be always sent to a Broker/Router process and are *NOT* supported directly by the Historical processes.
+Union datasources allow you to treat two or more table datasources as a single datasource. The datasources being unioned
+do not need to have identical schemas. If they do not fully match up, then columns that exist in one table but not
+another will be treated as if they contained all null values in the tables where they do not exist.
 
 Review comment:
   It is indirectly described [here](https://github.com/apache/druid/pull/9704/files#diff-203b8f56b37fff44bf7ab3b38b94308bR105-R107). 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409088084
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
 
 Review comment:
   Is this better?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
suneet-s commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r408992688
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
 
 Review comment:
   Overall comment: All the SQL queries don't have time filters. As someone new to Druid, I'd look at this page and start issuing SQL commands without time filters. We should point out that this will cause performance issues on anything other than small test data. Don't know what the best way to do that is though

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] jihoonson commented on issue #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
jihoonson commented on issue #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#issuecomment-614324809
 
 
   This PR is a blocker for 0.18.0. Since this PR contains only doc changes and the CI has passed for docs, I'm merging this PR now.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409091670
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always creating or writing into a table
+datasource. But at query time, there are many other types of datasources available.
 
-A data source is the Apache Druid equivalent of a database table. However, a query can also masquerade as a data source, providing subquery-like functionality. Query data sources are currently supported only by [GroupBy](../querying/groupbyquery.md) queries.
+In the [Druid SQL](sql.html) language, datasources are provided in the [`FROM` clause](sql.html#from).
 
-### Table datasource
-The table data source is the most common type. It's represented by a string, or by the full structure:
+The word "datasource" is generally spelled `dataSource` (with a capital S) when it appears in API requests and
+responses.
 
+## Datasource type
+
+### `table`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT column1, column2 FROM "druid"."dataSourceName"
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": "dataSourceName",
+  "columns": ["column1", "column2"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+The table datasource is the most common type. This is the kind of datasource you get when you perform
+[data ingestion](../ingestion/index.html). They are split up into segments, distributed around the cluster,
+and queried in parallel.
+
+In [Druid SQL](sql.html#from), table datasources reside in the the `druid` schema. This is the default schema, so table
+datasources can be referenced as either `druid.dataSourceName` or simply `dataSourceName`.
+
+In native queries, table datasources can be referenced using their names as strings (as in the example above), or by
+using JSON objects of the form:
+
+```json
+"dataSource": {
+  "type": "table",
+  "name": "dataSourceName"
+}
+```
+
+To see a list of all table datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'druid'`.
+
+### `lookup`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT k, v FROM lookup.countries
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": {
+    "type": "lookup",
+    "lookup": "countries"
+  },
+  "columns": ["k", "v"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Lookup datasources correspond to Druid's key-value [lookup](lookups.html) objects. In [Druid SQL](sql.html#from),
+they reside in the the `lookup` schema. They are preloaded in memory on all servers, so they can be accessed rapidly.
+They can be joined onto regular tables using the [join operator](#join).
+
+Lookup datasources are key-value oriented and always have exactly two columns: `k` (the key) and `v` (the value), and
+both are always strings.
+
+To see a list of all lookup datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'lookup'`.
+
+> Performance tip: Lookups can be joined with a base table either using an explicit [join](#join), or by using the
+> SQL [`LOOKUP` function](sql.html#string-functions).
+> However, the join operator must evaluate the condition on each row, whereas the
+> `LOOKUP` function can defer evaluation until after an aggregation phase. This means that the `LOOKUP` function is
+> usually faster than joining to a lookup datasource.
+
+### `query`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Uses a subquery to count hits per page, then takes the average.
+SELECT
+  AVG(cnt) AS average_hits_per_page
+FROM
+  (SELECT page, COUNT(*) AS hits FROM site_traffic GROUP BY page)
+```
+<!--Native-->
+```json
+{
+  "queryType": "timeseries",
+  "dataSource": {
+    "type": "query",
+    "query": {
+      "queryType": "groupBy",
+      "dataSource": "site_traffic",
+      "intervals": ["0000/3000"],
+      "granularity": "all",
+      "dimensions": ["page"],
+      "aggregations": [
+        { "type": "count", "name": "hits" }
+      ]
+    }
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "aggregations": [
+    { "type": "longSum", "name": "hits", "fieldName": "hits" },
+    { "type": "count", "name": "pages" }
+  ],
+  "postAggregations": [
+    { "type": "expression", "name": "average_hits_per_page", "expression": "hits / pages" }
+  ]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Query datasources allow you to issue subqueries. In native queries, they can appear anywhere that accepts a
+`dataSource`. In SQL, they can appear in the following places, always surrounded by parentheses:
+
+- The FROM clause: `FROM (<subquery>)`.
+- As inputs to a JOIN: `<table-or-subquery-1> t1 INNER JOIN <table-or-subquery-2> t2 ON t1.<col1> = t2.<col2>`.
+- In the WHERE clause: `WHERE <column> { IN | NOT IN } (<subquery>)`. These are translated to joins by the SQL planner.
+
+> Performance tip: Subquery results need to be fully transferred to the Broker as part of query execution.
 
 Review comment:
   Hmm I agree this is worded confusingly. I replaced it with the following.
   
   > Performance tip: In most cases, subquery results are fully buffered in memory on the Broker and then further processing occurs on the Broker itself. This means that subqueries with large result sets can cause performance bottlenecks or run into memory usage limits on the Broker. See the [Query execution](query-execution.md) documentation for more details on how subqueries are executed and what limits will apply.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
suneet-s commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409121328
 
 

 ##########
 File path: docs/querying/lookups.md
 ##########
 @@ -55,21 +56,37 @@ Other lookup types are available as extensions, including:
 Query Syntax
 ------------
 
-In [Druid SQL](sql.html), lookups can be queried using the `LOOKUP` function, for example:
+In [Druid SQL](sql.html), lookups can be queried using the [`LOOKUP` function](sql.md#string-functions), for example:
 
+```sql
+SELECT
+  LOOKUP(store, 'store_to_country') AS country,
+  SUM(revenue)
+FROM sales
+GROUP BY 1
 ```
-SELECT LOOKUP(column_name, 'lookup-name'), COUNT(*) FROM datasource GROUP BY 1
+
+They can also be queried using the [JOIN operator](datasource.md#join):
 
 Review comment:
   Do we want to talk about direct querying that was added in #9502

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
suneet-s commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409022621
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always creating or writing into a table
+datasource. But at query time, there are many other types of datasources available.
 
-A data source is the Apache Druid equivalent of a database table. However, a query can also masquerade as a data source, providing subquery-like functionality. Query data sources are currently supported only by [GroupBy](../querying/groupbyquery.md) queries.
+In the [Druid SQL](sql.html) language, datasources are provided in the [`FROM` clause](sql.html#from).
 
-### Table datasource
-The table data source is the most common type. It's represented by a string, or by the full structure:
+The word "datasource" is generally spelled `dataSource` (with a capital S) when it appears in API requests and
+responses.
 
+## Datasource type
+
+### `table`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT column1, column2 FROM "druid"."dataSourceName"
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": "dataSourceName",
+  "columns": ["column1", "column2"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+The table datasource is the most common type. This is the kind of datasource you get when you perform
+[data ingestion](../ingestion/index.html). They are split up into segments, distributed around the cluster,
+and queried in parallel.
+
+In [Druid SQL](sql.html#from), table datasources reside in the the `druid` schema. This is the default schema, so table
+datasources can be referenced as either `druid.dataSourceName` or simply `dataSourceName`.
+
+In native queries, table datasources can be referenced using their names as strings (as in the example above), or by
+using JSON objects of the form:
+
+```json
+"dataSource": {
+  "type": "table",
+  "name": "dataSourceName"
+}
+```
+
+To see a list of all table datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'druid'`.
+
+### `lookup`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT k, v FROM lookup.countries
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": {
+    "type": "lookup",
+    "lookup": "countries"
+  },
+  "columns": ["k", "v"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Lookup datasources correspond to Druid's key-value [lookup](lookups.html) objects. In [Druid SQL](sql.html#from),
+they reside in the the `lookup` schema. They are preloaded in memory on all servers, so they can be accessed rapidly.
+They can be joined onto regular tables using the [join operator](#join).
+
+Lookup datasources are key-value oriented and always have exactly two columns: `k` (the key) and `v` (the value), and
+both are always strings.
+
+To see a list of all lookup datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'lookup'`.
+
+> Performance tip: Lookups can be joined with a base table either using an explicit [join](#join), or by using the
+> SQL [`LOOKUP` function](sql.html#string-functions).
+> However, the join operator must evaluate the condition on each row, whereas the
+> `LOOKUP` function can defer evaluation until after an aggregation phase. This means that the `LOOKUP` function is
+> usually faster than joining to a lookup datasource.
+
+### `query`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Uses a subquery to count hits per page, then takes the average.
+SELECT
+  AVG(cnt) AS average_hits_per_page
+FROM
+  (SELECT page, COUNT(*) AS hits FROM site_traffic GROUP BY page)
+```
+<!--Native-->
+```json
+{
+  "queryType": "timeseries",
+  "dataSource": {
+    "type": "query",
+    "query": {
+      "queryType": "groupBy",
+      "dataSource": "site_traffic",
+      "intervals": ["0000/3000"],
+      "granularity": "all",
+      "dimensions": ["page"],
+      "aggregations": [
+        { "type": "count", "name": "hits" }
+      ]
+    }
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "aggregations": [
+    { "type": "longSum", "name": "hits", "fieldName": "hits" },
+    { "type": "count", "name": "pages" }
+  ],
+  "postAggregations": [
+    { "type": "expression", "name": "average_hits_per_page", "expression": "hits / pages" }
+  ]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Query datasources allow you to issue subqueries. In native queries, they can appear anywhere that accepts a
+`dataSource`. In SQL, they can appear in the following places, always surrounded by parentheses:
+
+- The FROM clause: `FROM (<subquery>)`.
+- As inputs to a JOIN: `<table-or-subquery-1> t1 INNER JOIN <table-or-subquery-2> t2 ON t1.<col1> = t2.<col2>`.
+- In the WHERE clause: `WHERE <column> { IN | NOT IN } (<subquery>)`. These are translated to joins by the SQL planner.
+
+> Performance tip: Subquery results need to be fully transferred to the Broker as part of query execution.
+> This means that subqueries with large result sets can cause performance bottlenecks or run into memory usage limits.
+> See the [Query execution](query-execution.md) documentation for more details on how subqueries are executed and what
+> limits will apply.
+
+### `join`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Joins "sales" with "countries" (using "store" as the join key) to get sales by country.
+SELECT
+  countries.v AS country,
+  SUM(sales.revenue) AS country_revenue
+FROM
+  sales
+  INNER JOIN lookup.store_to_country ON sales.store = store_to_country.k
+GROUP BY
+  countries.v
+```
+<!--Native-->
 ```json
 {
-	"type": "table",
-	"name": "<string_value>"
+  "queryType": "groupBy",
+  "dataSource": {
+    "type": "join",
+    "left": "sales",
+    "right": {
+      "type": "lookup",
+      "lookup": "store_to_country"
+    },
+    "rightPrefix": "r.",
+    "condition": "store == \"r.k\"",
+    "joinType": "INNER"
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "dimensions": [
+    { "type": "default", "outputName": "country", "dimension": "r.v" }
+  ],
+  "aggregations": [
+    { "type": "longSum", "name": "country_revenue", "fieldName": "revenue" }
+  ]
 }
 ```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Join datasources allow you to do a SQL-style join of two datasources. Stacking joins on top of each other allows
+you to join arbitrarily many datasources.
+
+In Druid {{DRUIDVERSION}}, joins are implemented with a broadcast hash-join algorithm. This means that all tables
+other than the leftmost "base" table must fit in memory. It also means that the join condition must be an equality. This
+feature is intended mainly to allow joining regular Druid tables with [lookup](#lookup), [inline](#inline), and
+[query](#query) datasources.
+
+For information about how Druid executes queries involving joins, refer to the
+[Query execution](query-execution.html#join) page.
+
+#### Joins in SQL
+
+SQL joins take the form:
+
+```
+<o1> [ INNER | LEFT [OUTER] ] JOIN <o2> ON <condition>
+```
+
+The condition must involve only equalities, but functions are okay, and there can be multiple equalities ANDed together.
+Conditions like `t1.x = t2.x`, or `LOWER(t1.x) = t2.x`, or `t1.x = t2.x AND t1.y = t2.y` can all be handled. Conditions
+like `t1.x <> t2.x` cannot currently be handled.
 
-### Union datasource
+Note that Druid SQL is less rigid than what native join datasources can handle. In cases where a SQL query does
 
 Review comment:
   Maybe make this block a warning block?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409066357
 
 

 ##########
 File path: docs/misc/math-expr.md
 ##########
 @@ -22,9 +22,14 @@ title: "Expressions"
   ~ under the License.
   -->
 
-
-> This feature is still experimental. It has not been optimized for performance yet, and its implementation is known to
->  have significant inefficiencies.
 
 Review comment:
   IIRC, a couple things:
   
   - Improved performance by adding various optimized selectors like SingleStringInputDimensionSelector, SingleLongInputCachingExpressionColumnValueSelector, SingleStringInputCachingExpressionColumnValueSelector.
   - Battle tested by having expressions be the main way of handling pre- and post-aggregation projections in Druid SQL.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
suneet-s commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r408984246
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always creating or writing into a table
+datasource. But at query time, there are many other types of datasources available.
 
-A data source is the Apache Druid equivalent of a database table. However, a query can also masquerade as a data source, providing subquery-like functionality. Query data sources are currently supported only by [GroupBy](../querying/groupbyquery.md) queries.
+In the [Druid SQL](sql.html) language, datasources are provided in the [`FROM` clause](sql.html#from).
 
-### Table datasource
-The table data source is the most common type. It's represented by a string, or by the full structure:
+The word "datasource" is generally spelled `dataSource` (with a capital S) when it appears in API requests and
+responses.
 
+## Datasource type
+
+### `table`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT column1, column2 FROM "druid"."dataSourceName"
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": "dataSourceName",
+  "columns": ["column1", "column2"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+The table datasource is the most common type. This is the kind of datasource you get when you perform
+[data ingestion](../ingestion/index.html). They are split up into segments, distributed around the cluster,
+and queried in parallel.
+
+In [Druid SQL](sql.html#from), table datasources reside in the the `druid` schema. This is the default schema, so table
+datasources can be referenced as either `druid.dataSourceName` or simply `dataSourceName`.
+
+In native queries, table datasources can be referenced using their names as strings (as in the example above), or by
+using JSON objects of the form:
+
+```json
+"dataSource": {
+  "type": "table",
+  "name": "dataSourceName"
+}
+```
+
+To see a list of all table datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'druid'`.
+
+### `lookup`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT k, v FROM lookup.countries
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": {
+    "type": "lookup",
+    "lookup": "countries"
+  },
+  "columns": ["k", "v"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Lookup datasources correspond to Druid's key-value [lookup](lookups.html) objects. In [Druid SQL](sql.html#from),
+they reside in the the `lookup` schema. They are preloaded in memory on all servers, so they can be accessed rapidly.
+They can be joined onto regular tables using the [join operator](#join).
+
+Lookup datasources are key-value oriented and always have exactly two columns: `k` (the key) and `v` (the value), and
+both are always strings.
+
+To see a list of all lookup datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'lookup'`.
+
+> Performance tip: Lookups can be joined with a base table either using an explicit [join](#join), or by using the
+> SQL [`LOOKUP` function](sql.html#string-functions).
+> However, the join operator must evaluate the condition on each row, whereas the
+> `LOOKUP` function can defer evaluation until after an aggregation phase. This means that the `LOOKUP` function is
+> usually faster than joining to a lookup datasource.
+
+### `query`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Uses a subquery to count hits per page, then takes the average.
+SELECT
+  AVG(cnt) AS average_hits_per_page
+FROM
+  (SELECT page, COUNT(*) AS hits FROM site_traffic GROUP BY page)
+```
+<!--Native-->
+```json
+{
+  "queryType": "timeseries",
+  "dataSource": {
+    "type": "query",
+    "query": {
+      "queryType": "groupBy",
+      "dataSource": "site_traffic",
+      "intervals": ["0000/3000"],
+      "granularity": "all",
+      "dimensions": ["page"],
+      "aggregations": [
+        { "type": "count", "name": "hits" }
+      ]
+    }
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "aggregations": [
+    { "type": "longSum", "name": "hits", "fieldName": "hits" },
+    { "type": "count", "name": "pages" }
+  ],
+  "postAggregations": [
+    { "type": "expression", "name": "average_hits_per_page", "expression": "hits / pages" }
+  ]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Query datasources allow you to issue subqueries. In native queries, they can appear anywhere that accepts a
+`dataSource`. In SQL, they can appear in the following places, always surrounded by parentheses:
+
+- The FROM clause: `FROM (<subquery>)`.
+- As inputs to a JOIN: `<table-or-subquery-1> t1 INNER JOIN <table-or-subquery-2> t2 ON t1.<col1> = t2.<col2>`.
+- In the WHERE clause: `WHERE <column> { IN | NOT IN } (<subquery>)`. These are translated to joins by the SQL planner.
+
+> Performance tip: Subquery results need to be fully transferred to the Broker as part of query execution.
 
 Review comment:
   Is this only true if the datasource that is being queried is a table. If it's a lookup or an inline datasource, I think this would be done on the broker, we could still run in to memory issues, and perf is still bad, but we don't pay the transfer cost.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409082933
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always creating or writing into a table
+datasource. But at query time, there are many other types of datasources available.
 
-A data source is the Apache Druid equivalent of a database table. However, a query can also masquerade as a data source, providing subquery-like functionality. Query data sources are currently supported only by [GroupBy](../querying/groupbyquery.md) queries.
+In the [Druid SQL](sql.html) language, datasources are provided in the [`FROM` clause](sql.html#from).
 
-### Table datasource
-The table data source is the most common type. It's represented by a string, or by the full structure:
+The word "datasource" is generally spelled `dataSource` (with a capital S) when it appears in API requests and
+responses.
 
+## Datasource type
+
+### `table`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT column1, column2 FROM "druid"."dataSourceName"
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": "dataSourceName",
+  "columns": ["column1", "column2"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+The table datasource is the most common type. This is the kind of datasource you get when you perform
+[data ingestion](../ingestion/index.html). They are split up into segments, distributed around the cluster,
+and queried in parallel.
+
+In [Druid SQL](sql.html#from), table datasources reside in the the `druid` schema. This is the default schema, so table
+datasources can be referenced as either `druid.dataSourceName` or simply `dataSourceName`.
+
+In native queries, table datasources can be referenced using their names as strings (as in the example above), or by
+using JSON objects of the form:
+
+```json
+"dataSource": {
+  "type": "table",
+  "name": "dataSourceName"
+}
+```
+
+To see a list of all table datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'druid'`.
+
+### `lookup`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT k, v FROM lookup.countries
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": {
+    "type": "lookup",
+    "lookup": "countries"
+  },
+  "columns": ["k", "v"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Lookup datasources correspond to Druid's key-value [lookup](lookups.html) objects. In [Druid SQL](sql.html#from),
+they reside in the the `lookup` schema. They are preloaded in memory on all servers, so they can be accessed rapidly.
+They can be joined onto regular tables using the [join operator](#join).
+
+Lookup datasources are key-value oriented and always have exactly two columns: `k` (the key) and `v` (the value), and
+both are always strings.
+
+To see a list of all lookup datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'lookup'`.
+
+> Performance tip: Lookups can be joined with a base table either using an explicit [join](#join), or by using the
+> SQL [`LOOKUP` function](sql.html#string-functions).
+> However, the join operator must evaluate the condition on each row, whereas the
+> `LOOKUP` function can defer evaluation until after an aggregation phase. This means that the `LOOKUP` function is
+> usually faster than joining to a lookup datasource.
+
+### `query`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Uses a subquery to count hits per page, then takes the average.
+SELECT
+  AVG(cnt) AS average_hits_per_page
+FROM
+  (SELECT page, COUNT(*) AS hits FROM site_traffic GROUP BY page)
+```
+<!--Native-->
+```json
+{
+  "queryType": "timeseries",
+  "dataSource": {
+    "type": "query",
+    "query": {
+      "queryType": "groupBy",
+      "dataSource": "site_traffic",
+      "intervals": ["0000/3000"],
+      "granularity": "all",
+      "dimensions": ["page"],
+      "aggregations": [
+        { "type": "count", "name": "hits" }
+      ]
+    }
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "aggregations": [
+    { "type": "longSum", "name": "hits", "fieldName": "hits" },
+    { "type": "count", "name": "pages" }
+  ],
+  "postAggregations": [
+    { "type": "expression", "name": "average_hits_per_page", "expression": "hits / pages" }
+  ]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Query datasources allow you to issue subqueries. In native queries, they can appear anywhere that accepts a
+`dataSource`. In SQL, they can appear in the following places, always surrounded by parentheses:
+
+- The FROM clause: `FROM (<subquery>)`.
+- As inputs to a JOIN: `<table-or-subquery-1> t1 INNER JOIN <table-or-subquery-2> t2 ON t1.<col1> = t2.<col2>`.
+- In the WHERE clause: `WHERE <column> { IN | NOT IN } (<subquery>)`. These are translated to joins by the SQL planner.
+
+> Performance tip: Subquery results need to be fully transferred to the Broker as part of query execution.
+> This means that subqueries with large result sets can cause performance bottlenecks or run into memory usage limits.
+> See the [Query execution](query-execution.md) documentation for more details on how subqueries are executed and what
+> limits will apply.
+
+### `join`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Joins "sales" with "countries" (using "store" as the join key) to get sales by country.
+SELECT
+  countries.v AS country,
+  SUM(sales.revenue) AS country_revenue
+FROM
+  sales
+  INNER JOIN lookup.store_to_country ON sales.store = store_to_country.k
+GROUP BY
+  countries.v
+```
+<!--Native-->
 ```json
 {
-	"type": "table",
-	"name": "<string_value>"
+  "queryType": "groupBy",
+  "dataSource": {
+    "type": "join",
+    "left": "sales",
+    "right": {
+      "type": "lookup",
+      "lookup": "store_to_country"
+    },
+    "rightPrefix": "r.",
+    "condition": "store == \"r.k\"",
+    "joinType": "INNER"
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "dimensions": [
+    { "type": "default", "outputName": "country", "dimension": "r.v" }
+  ],
+  "aggregations": [
+    { "type": "longSum", "name": "country_revenue", "fieldName": "revenue" }
+  ]
 }
 ```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Join datasources allow you to do a SQL-style join of two datasources. Stacking joins on top of each other allows
+you to join arbitrarily many datasources.
+
+In Druid {{DRUIDVERSION}}, joins are implemented with a broadcast hash-join algorithm. This means that all tables
+other than the leftmost "base" table must fit in memory. It also means that the join condition must be an equality. This
+feature is intended mainly to allow joining regular Druid tables with [lookup](#lookup), [inline](#inline), and
+[query](#query) datasources.
+
+For information about how Druid executes queries involving joins, refer to the
+[Query execution](query-execution.html#join) page.
+
+#### Joins in SQL
+
+SQL joins take the form:
+
+```
+<o1> [ INNER | LEFT [OUTER] ] JOIN <o2> ON <condition>
+```
+
+The condition must involve only equalities, but functions are okay, and there can be multiple equalities ANDed together.
+Conditions like `t1.x = t2.x`, or `LOWER(t1.x) = t2.x`, or `t1.x = t2.x AND t1.y = t2.y` can all be handled. Conditions
+like `t1.x <> t2.x` cannot currently be handled.
 
-### Union datasource
+Note that Druid SQL is less rigid than what native join datasources can handle. In cases where a SQL query does
+something that is not allowed as-is with a native join datasource, Druid SQL will generate a subquery. This can have
+a substantial effect on performance and scalability, so it is something to watch out for. Some examples of when the
+SQL layer will generate subqueries include:
 
-This data source unions two or more table data sources.
+- Joining a regular Druid table to itself, or to another regular Druid table. The native join datasource can accept
+a table on the left-hand side, but not the right, so a subquery is needed.
 
+- Join conditions where the expressions on either side are of different types.
+
+- Join conditions where the right-hand expression is not a direct column access.
+
+For more information about how Druid translates SQL to native queries, refer to the
+[Druid SQL](sql.md#query-translation) documentation.
+
+#### Joins in native queries
+
+Native join datasources have the following properties. All are required.
+
+|Field|Description|
+|-----|-----------|
+|`left`|Left-hand datasource. Must be of type `table`, `join`, `lookup`, `query`, or `inline`. Placing another join as the left datasource allows you to join arbitrarily many datasources.|
+|`right`|Right-hand datasource. Must be of type `lookup`, `query`, or `inline`. Note that this is more rigid than what Druid SQL requires.|
+|`rightPrefix`|String prefix that will be applied to all columns from the right-hand datasource, to prevent them from colliding with columns from the left-hand datasource. Can be any string, so long as it is nonempty and is not be a prefix of the string `__time`. Any columns from the left-hand side that start with your `rightPrefix` will be shadowed. It is up to you to provide a prefix that will not shadow any important columns from the left side.|
+|`condition`|[Expression](../misc/math-expr.md) that must be an equality where one side is an expression of the left-hand side, and the other side is a simple column reference to the right-hand side. Note that this is more rigid than what Druid SQL requires: here, the right-hand reference must be a simple column reference; in SQL it can be an expression.|
+|`joinType`|`INNER` or `LEFT`.|
+
+#### Join performance
+
+Joins are a feature that can significantly affect performance of your queries. Some performance tips and notes:
+
+1. Joins are especially useful with [lookup datasources](#lookup), but in most cases, the
+[`LOOKUP` function](sql.html#string-functions) performs better than a join. Consider using the `LOOKUP` function if
+it is appropriate for your use case.
+2. When using joins in Druid SQL, keep in mind that it can generate subqueries that you did not explicitly include in
+your queries. Refer to the [Druid SQL](sql.md#query-translation) documentation for more details about when this happens
+and how to detect it.
+3. One common reason for implicit subquery generation is if the types of the two halves of an equality do not match.
+For example, since lookup keys are always strings, the condition `druid.d JOIN lookup.l ON d.field = l.field` will
+perform best if `d.field` is a string.
+4. As of Druid {{DRUIDVERSION}}, the join operator must evaluate the condition for each row. In the future, we expect
+to implement both early and deferred condition evaluation, which we expect to improve performance considerably for
+common use cases.
+
+#### Future work for joins
+
+Joins are an area of active development in Druid. The following features are missing today but may appear in
+future versions:
+
+- Preloaded dimension tables that are wider than lookups (i.e. supporting more than a single key and single value).
+- RIGHT OUTER and FULL OUTER joins. Currently, they are partially implemented. Queries will run but results will not
 
 Review comment:
   There are some tests for the functionality and they do pass. They're in CalciteQueryTest (SQL) and HashJoinSegmentStorageAdapterTest (native). What query did you try to run?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] sthetland commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
sthetland commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r408953496
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always creating or writing into a table
+datasource. But at query time, there are many other types of datasources available.
 
-A data source is the Apache Druid equivalent of a database table. However, a query can also masquerade as a data source, providing subquery-like functionality. Query data sources are currently supported only by [GroupBy](../querying/groupbyquery.md) queries.
+In the [Druid SQL](sql.html) language, datasources are provided in the [`FROM` clause](sql.html#from).
 
-### Table datasource
-The table data source is the most common type. It's represented by a string, or by the full structure:
+The word "datasource" is generally spelled `dataSource` (with a capital S) when it appears in API requests and
+responses.
 
+## Datasource type
+
+### `table`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT column1, column2 FROM "druid"."dataSourceName"
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": "dataSourceName",
+  "columns": ["column1", "column2"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+The table datasource is the most common type. This is the kind of datasource you get when you perform
+[data ingestion](../ingestion/index.html). They are split up into segments, distributed around the cluster,
+and queried in parallel.
+
+In [Druid SQL](sql.html#from), table datasources reside in the the `druid` schema. This is the default schema, so table
+datasources can be referenced as either `druid.dataSourceName` or simply `dataSourceName`.
+
+In native queries, table datasources can be referenced using their names as strings (as in the example above), or by
+using JSON objects of the form:
+
+```json
+"dataSource": {
+  "type": "table",
+  "name": "dataSourceName"
+}
+```
+
+To see a list of all table datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'druid'`.
+
+### `lookup`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT k, v FROM lookup.countries
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": {
+    "type": "lookup",
+    "lookup": "countries"
+  },
+  "columns": ["k", "v"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Lookup datasources correspond to Druid's key-value [lookup](lookups.html) objects. In [Druid SQL](sql.html#from),
+they reside in the the `lookup` schema. They are preloaded in memory on all servers, so they can be accessed rapidly.
+They can be joined onto regular tables using the [join operator](#join).
+
+Lookup datasources are key-value oriented and always have exactly two columns: `k` (the key) and `v` (the value), and
+both are always strings.
+
+To see a list of all lookup datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'lookup'`.
+
+> Performance tip: Lookups can be joined with a base table either using an explicit [join](#join), or by using the
+> SQL [`LOOKUP` function](sql.html#string-functions).
+> However, the join operator must evaluate the condition on each row, whereas the
+> `LOOKUP` function can defer evaluation until after an aggregation phase. This means that the `LOOKUP` function is
+> usually faster than joining to a lookup datasource.
+
+### `query`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Uses a subquery to count hits per page, then takes the average.
+SELECT
+  AVG(cnt) AS average_hits_per_page
+FROM
+  (SELECT page, COUNT(*) AS hits FROM site_traffic GROUP BY page)
+```
+<!--Native-->
+```json
+{
+  "queryType": "timeseries",
+  "dataSource": {
+    "type": "query",
+    "query": {
+      "queryType": "groupBy",
+      "dataSource": "site_traffic",
+      "intervals": ["0000/3000"],
+      "granularity": "all",
+      "dimensions": ["page"],
+      "aggregations": [
+        { "type": "count", "name": "hits" }
+      ]
+    }
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "aggregations": [
+    { "type": "longSum", "name": "hits", "fieldName": "hits" },
+    { "type": "count", "name": "pages" }
+  ],
+  "postAggregations": [
+    { "type": "expression", "name": "average_hits_per_page", "expression": "hits / pages" }
+  ]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Query datasources allow you to issue subqueries. In native queries, they can appear anywhere that accepts a
+`dataSource`. In SQL, they can appear in the following places, always surrounded by parentheses:
+
+- The FROM clause: `FROM (<subquery>)`.
+- As inputs to a JOIN: `<table-or-subquery-1> t1 INNER JOIN <table-or-subquery-2> t2 ON t1.<col1> = t2.<col2>`.
+- In the WHERE clause: `WHERE <column> { IN | NOT IN } (<subquery>)`. These are translated to joins by the SQL planner.
+
+> Performance tip: Subquery results need to be fully transferred to the Broker as part of query execution.
+> This means that subqueries with large result sets can cause performance bottlenecks or run into memory usage limits.
+> See the [Query execution](query-execution.md) documentation for more details on how subqueries are executed and what
+> limits will apply.
+
+### `join`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Joins "sales" with "countries" (using "store" as the join key) to get sales by country.
+SELECT
+  countries.v AS country,
+  SUM(sales.revenue) AS country_revenue
+FROM
+  sales
+  INNER JOIN lookup.store_to_country ON sales.store = store_to_country.k
+GROUP BY
+  countries.v
+```
+<!--Native-->
 ```json
 {
-	"type": "table",
-	"name": "<string_value>"
+  "queryType": "groupBy",
+  "dataSource": {
+    "type": "join",
+    "left": "sales",
+    "right": {
+      "type": "lookup",
+      "lookup": "store_to_country"
+    },
+    "rightPrefix": "r.",
+    "condition": "store == \"r.k\"",
+    "joinType": "INNER"
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "dimensions": [
+    { "type": "default", "outputName": "country", "dimension": "r.v" }
+  ],
+  "aggregations": [
+    { "type": "longSum", "name": "country_revenue", "fieldName": "revenue" }
+  ]
 }
 ```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Join datasources allow you to do a SQL-style join of two datasources. Stacking joins on top of each other allows
+you to join arbitrarily many datasources.
+
+In Druid {{DRUIDVERSION}}, joins are implemented with a broadcast hash-join algorithm. This means that all tables
+other than the leftmost "base" table must fit in memory. It also means that the join condition must be an equality. This
+feature is intended mainly to allow joining regular Druid tables with [lookup](#lookup), [inline](#inline), and
+[query](#query) datasources.
+
+For information about how Druid executes queries involving joins, refer to the
+[Query execution](query-execution.html#join) page.
+
+#### Joins in SQL
+
+SQL joins take the form:
+
+```
+<o1> [ INNER | LEFT [OUTER] ] JOIN <o2> ON <condition>
+```
+
+The condition must involve only equalities, but functions are okay, and there can be multiple equalities ANDed together.
+Conditions like `t1.x = t2.x`, or `LOWER(t1.x) = t2.x`, or `t1.x = t2.x AND t1.y = t2.y` can all be handled. Conditions
+like `t1.x <> t2.x` cannot currently be handled.
 
-### Union datasource
+Note that Druid SQL is less rigid than what native join datasources can handle. In cases where a SQL query does
 
 Review comment:
   optional... if it's a bit clearer.
   
   ```suggestion
   Druid SQL is less rigid in handling join datasources than native queries. In cases where a SQL query does
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
suneet-s commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r408978233
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always creating or writing into a table
+datasource. But at query time, there are many other types of datasources available.
 
-A data source is the Apache Druid equivalent of a database table. However, a query can also masquerade as a data source, providing subquery-like functionality. Query data sources are currently supported only by [GroupBy](../querying/groupbyquery.md) queries.
+In the [Druid SQL](sql.html) language, datasources are provided in the [`FROM` clause](sql.html#from).
 
-### Table datasource
-The table data source is the most common type. It's represented by a string, or by the full structure:
+The word "datasource" is generally spelled `dataSource` (with a capital S) when it appears in API requests and
+responses.
 
+## Datasource type
+
+### `table`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT column1, column2 FROM "druid"."dataSourceName"
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": "dataSourceName",
+  "columns": ["column1", "column2"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+The table datasource is the most common type. This is the kind of datasource you get when you perform
+[data ingestion](../ingestion/index.html). They are split up into segments, distributed around the cluster,
+and queried in parallel.
+
+In [Druid SQL](sql.html#from), table datasources reside in the the `druid` schema. This is the default schema, so table
+datasources can be referenced as either `druid.dataSourceName` or simply `dataSourceName`.
+
+In native queries, table datasources can be referenced using their names as strings (as in the example above), or by
+using JSON objects of the form:
+
+```json
+"dataSource": {
+  "type": "table",
+  "name": "dataSourceName"
+}
+```
+
+To see a list of all table datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'druid'`.
+
+### `lookup`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT k, v FROM lookup.countries
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": {
+    "type": "lookup",
+    "lookup": "countries"
+  },
+  "columns": ["k", "v"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Lookup datasources correspond to Druid's key-value [lookup](lookups.html) objects. In [Druid SQL](sql.html#from),
+they reside in the the `lookup` schema. They are preloaded in memory on all servers, so they can be accessed rapidly.
+They can be joined onto regular tables using the [join operator](#join).
+
+Lookup datasources are key-value oriented and always have exactly two columns: `k` (the key) and `v` (the value), and
+both are always strings.
+
+To see a list of all lookup datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'lookup'`.
+
+> Performance tip: Lookups can be joined with a base table either using an explicit [join](#join), or by using the
+> SQL [`LOOKUP` function](sql.html#string-functions).
+> However, the join operator must evaluate the condition on each row, whereas the
+> `LOOKUP` function can defer evaluation until after an aggregation phase. This means that the `LOOKUP` function is
+> usually faster than joining to a lookup datasource.
+
+### `query`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Uses a subquery to count hits per page, then takes the average.
+SELECT
+  AVG(cnt) AS average_hits_per_page
+FROM
+  (SELECT page, COUNT(*) AS hits FROM site_traffic GROUP BY page)
+```
+<!--Native-->
+```json
+{
+  "queryType": "timeseries",
+  "dataSource": {
+    "type": "query",
+    "query": {
+      "queryType": "groupBy",
+      "dataSource": "site_traffic",
+      "intervals": ["0000/3000"],
+      "granularity": "all",
+      "dimensions": ["page"],
+      "aggregations": [
+        { "type": "count", "name": "hits" }
+      ]
+    }
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "aggregations": [
+    { "type": "longSum", "name": "hits", "fieldName": "hits" },
+    { "type": "count", "name": "pages" }
+  ],
+  "postAggregations": [
+    { "type": "expression", "name": "average_hits_per_page", "expression": "hits / pages" }
+  ]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Query datasources allow you to issue subqueries. In native queries, they can appear anywhere that accepts a
+`dataSource`. In SQL, they can appear in the following places, always surrounded by parentheses:
+
+- The FROM clause: `FROM (<subquery>)`.
+- As inputs to a JOIN: `<table-or-subquery-1> t1 INNER JOIN <table-or-subquery-2> t2 ON t1.<col1> = t2.<col2>`.
+- In the WHERE clause: `WHERE <column> { IN | NOT IN } (<subquery>)`. These are translated to joins by the SQL planner.
+
+> Performance tip: Subquery results need to be fully transferred to the Broker as part of query execution.
+> This means that subqueries with large result sets can cause performance bottlenecks or run into memory usage limits.
+> See the [Query execution](query-execution.md) documentation for more details on how subqueries are executed and what
+> limits will apply.
+
+### `join`
 
 Review comment:
   I think this should be the last in the list as it can build on top of other concepts described in the different datasource types.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409088084
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
 
 Review comment:
   I'm not sure this is better so I'm inclined to leave it.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
suneet-s commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409130420
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always creating or writing into a table
+datasource. But at query time, there are many other types of datasources available.
 
-A data source is the Apache Druid equivalent of a database table. However, a query can also masquerade as a data source, providing subquery-like functionality. Query data sources are currently supported only by [GroupBy](../querying/groupbyquery.md) queries.
+In the [Druid SQL](sql.html) language, datasources are provided in the [`FROM` clause](sql.html#from).
 
-### Table datasource
-The table data source is the most common type. It's represented by a string, or by the full structure:
+The word "datasource" is generally spelled `dataSource` (with a capital S) when it appears in API requests and
+responses.
 
+## Datasource type
+
+### `table`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT column1, column2 FROM "druid"."dataSourceName"
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": "dataSourceName",
+  "columns": ["column1", "column2"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+The table datasource is the most common type. This is the kind of datasource you get when you perform
+[data ingestion](../ingestion/index.html). They are split up into segments, distributed around the cluster,
+and queried in parallel.
+
+In [Druid SQL](sql.html#from), table datasources reside in the the `druid` schema. This is the default schema, so table
+datasources can be referenced as either `druid.dataSourceName` or simply `dataSourceName`.
+
+In native queries, table datasources can be referenced using their names as strings (as in the example above), or by
+using JSON objects of the form:
+
+```json
+"dataSource": {
+  "type": "table",
+  "name": "dataSourceName"
+}
+```
+
+To see a list of all table datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'druid'`.
+
+### `lookup`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT k, v FROM lookup.countries
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": {
+    "type": "lookup",
+    "lookup": "countries"
+  },
+  "columns": ["k", "v"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Lookup datasources correspond to Druid's key-value [lookup](lookups.html) objects. In [Druid SQL](sql.html#from),
+they reside in the the `lookup` schema. They are preloaded in memory on all servers, so they can be accessed rapidly.
+They can be joined onto regular tables using the [join operator](#join).
+
+Lookup datasources are key-value oriented and always have exactly two columns: `k` (the key) and `v` (the value), and
+both are always strings.
+
+To see a list of all lookup datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'lookup'`.
+
+> Performance tip: Lookups can be joined with a base table either using an explicit [join](#join), or by using the
+> SQL [`LOOKUP` function](sql.html#string-functions).
+> However, the join operator must evaluate the condition on each row, whereas the
+> `LOOKUP` function can defer evaluation until after an aggregation phase. This means that the `LOOKUP` function is
+> usually faster than joining to a lookup datasource.
+
+### `query`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Uses a subquery to count hits per page, then takes the average.
+SELECT
+  AVG(cnt) AS average_hits_per_page
+FROM
+  (SELECT page, COUNT(*) AS hits FROM site_traffic GROUP BY page)
+```
+<!--Native-->
+```json
+{
+  "queryType": "timeseries",
+  "dataSource": {
+    "type": "query",
+    "query": {
+      "queryType": "groupBy",
+      "dataSource": "site_traffic",
+      "intervals": ["0000/3000"],
+      "granularity": "all",
+      "dimensions": ["page"],
+      "aggregations": [
+        { "type": "count", "name": "hits" }
+      ]
+    }
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "aggregations": [
+    { "type": "longSum", "name": "hits", "fieldName": "hits" },
+    { "type": "count", "name": "pages" }
+  ],
+  "postAggregations": [
+    { "type": "expression", "name": "average_hits_per_page", "expression": "hits / pages" }
+  ]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Query datasources allow you to issue subqueries. In native queries, they can appear anywhere that accepts a
+`dataSource`. In SQL, they can appear in the following places, always surrounded by parentheses:
+
+- The FROM clause: `FROM (<subquery>)`.
+- As inputs to a JOIN: `<table-or-subquery-1> t1 INNER JOIN <table-or-subquery-2> t2 ON t1.<col1> = t2.<col2>`.
+- In the WHERE clause: `WHERE <column> { IN | NOT IN } (<subquery>)`. These are translated to joins by the SQL planner.
+
+> Performance tip: Subquery results need to be fully transferred to the Broker as part of query execution.
+> This means that subqueries with large result sets can cause performance bottlenecks or run into memory usage limits.
+> See the [Query execution](query-execution.md) documentation for more details on how subqueries are executed and what
+> limits will apply.
+
+### `join`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Joins "sales" with "countries" (using "store" as the join key) to get sales by country.
+SELECT
+  countries.v AS country,
+  SUM(sales.revenue) AS country_revenue
+FROM
+  sales
+  INNER JOIN lookup.store_to_country ON sales.store = store_to_country.k
+GROUP BY
+  countries.v
+```
+<!--Native-->
 ```json
 {
-	"type": "table",
-	"name": "<string_value>"
+  "queryType": "groupBy",
+  "dataSource": {
+    "type": "join",
+    "left": "sales",
+    "right": {
+      "type": "lookup",
+      "lookup": "store_to_country"
+    },
+    "rightPrefix": "r.",
+    "condition": "store == \"r.k\"",
+    "joinType": "INNER"
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "dimensions": [
+    { "type": "default", "outputName": "country", "dimension": "r.v" }
+  ],
+  "aggregations": [
+    { "type": "longSum", "name": "country_revenue", "fieldName": "revenue" }
+  ]
 }
 ```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Join datasources allow you to do a SQL-style join of two datasources. Stacking joins on top of each other allows
+you to join arbitrarily many datasources.
+
+In Druid {{DRUIDVERSION}}, joins are implemented with a broadcast hash-join algorithm. This means that all tables
+other than the leftmost "base" table must fit in memory. It also means that the join condition must be an equality. This
+feature is intended mainly to allow joining regular Druid tables with [lookup](#lookup), [inline](#inline), and
+[query](#query) datasources.
+
+For information about how Druid executes queries involving joins, refer to the
+[Query execution](query-execution.html#join) page.
+
+#### Joins in SQL
+
+SQL joins take the form:
+
+```
+<o1> [ INNER | LEFT [OUTER] ] JOIN <o2> ON <condition>
+```
+
+The condition must involve only equalities, but functions are okay, and there can be multiple equalities ANDed together.
+Conditions like `t1.x = t2.x`, or `LOWER(t1.x) = t2.x`, or `t1.x = t2.x AND t1.y = t2.y` can all be handled. Conditions
+like `t1.x <> t2.x` cannot currently be handled.
 
-### Union datasource
+Note that Druid SQL is less rigid than what native join datasources can handle. In cases where a SQL query does
 
 Review comment:
   Maybe the section on line 242 could be highlighted to drive home the point this paragraph is trying to make. It also has the benefit of linking to the query translation docs that tell a user how to read the query plan to know whether what Druid is trying to do makes sense or not.
   
   ```
   Since Druid SQL is less rigid in handling join datasource than native queries, it is important to understand how Druid translates SQL to native queries, refer to the
   [Druid SQL](sql.md#query-translation) documentation.
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409085526
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always creating or writing into a table
+datasource. But at query time, there are many other types of datasources available.
 
-A data source is the Apache Druid equivalent of a database table. However, a query can also masquerade as a data source, providing subquery-like functionality. Query data sources are currently supported only by [GroupBy](../querying/groupbyquery.md) queries.
+In the [Druid SQL](sql.html) language, datasources are provided in the [`FROM` clause](sql.html#from).
 
-### Table datasource
-The table data source is the most common type. It's represented by a string, or by the full structure:
+The word "datasource" is generally spelled `dataSource` (with a capital S) when it appears in API requests and
+responses.
 
+## Datasource type
+
+### `table`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT column1, column2 FROM "druid"."dataSourceName"
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": "dataSourceName",
+  "columns": ["column1", "column2"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+The table datasource is the most common type. This is the kind of datasource you get when you perform
+[data ingestion](../ingestion/index.html). They are split up into segments, distributed around the cluster,
+and queried in parallel.
+
+In [Druid SQL](sql.html#from), table datasources reside in the the `druid` schema. This is the default schema, so table
+datasources can be referenced as either `druid.dataSourceName` or simply `dataSourceName`.
+
+In native queries, table datasources can be referenced using their names as strings (as in the example above), or by
+using JSON objects of the form:
+
+```json
+"dataSource": {
+  "type": "table",
+  "name": "dataSourceName"
+}
+```
+
+To see a list of all table datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'druid'`.
+
+### `lookup`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT k, v FROM lookup.countries
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": {
+    "type": "lookup",
+    "lookup": "countries"
+  },
+  "columns": ["k", "v"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Lookup datasources correspond to Druid's key-value [lookup](lookups.html) objects. In [Druid SQL](sql.html#from),
+they reside in the the `lookup` schema. They are preloaded in memory on all servers, so they can be accessed rapidly.
+They can be joined onto regular tables using the [join operator](#join).
+
+Lookup datasources are key-value oriented and always have exactly two columns: `k` (the key) and `v` (the value), and
+both are always strings.
+
+To see a list of all lookup datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'lookup'`.
+
+> Performance tip: Lookups can be joined with a base table either using an explicit [join](#join), or by using the
+> SQL [`LOOKUP` function](sql.html#string-functions).
+> However, the join operator must evaluate the condition on each row, whereas the
+> `LOOKUP` function can defer evaluation until after an aggregation phase. This means that the `LOOKUP` function is
+> usually faster than joining to a lookup datasource.
+
+### `query`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Uses a subquery to count hits per page, then takes the average.
+SELECT
+  AVG(cnt) AS average_hits_per_page
+FROM
+  (SELECT page, COUNT(*) AS hits FROM site_traffic GROUP BY page)
+```
+<!--Native-->
+```json
+{
+  "queryType": "timeseries",
+  "dataSource": {
+    "type": "query",
+    "query": {
+      "queryType": "groupBy",
+      "dataSource": "site_traffic",
+      "intervals": ["0000/3000"],
+      "granularity": "all",
+      "dimensions": ["page"],
+      "aggregations": [
+        { "type": "count", "name": "hits" }
+      ]
+    }
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "aggregations": [
+    { "type": "longSum", "name": "hits", "fieldName": "hits" },
+    { "type": "count", "name": "pages" }
+  ],
+  "postAggregations": [
+    { "type": "expression", "name": "average_hits_per_page", "expression": "hits / pages" }
+  ]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Query datasources allow you to issue subqueries. In native queries, they can appear anywhere that accepts a
+`dataSource`. In SQL, they can appear in the following places, always surrounded by parentheses:
+
+- The FROM clause: `FROM (<subquery>)`.
+- As inputs to a JOIN: `<table-or-subquery-1> t1 INNER JOIN <table-or-subquery-2> t2 ON t1.<col1> = t2.<col2>`.
+- In the WHERE clause: `WHERE <column> { IN | NOT IN } (<subquery>)`. These are translated to joins by the SQL planner.
+
+> Performance tip: Subquery results need to be fully transferred to the Broker as part of query execution.
+> This means that subqueries with large result sets can cause performance bottlenecks or run into memory usage limits.
+> See the [Query execution](query-execution.md) documentation for more details on how subqueries are executed and what
+> limits will apply.
+
+### `join`
 
 Review comment:
   My rationale here was I wanted to put the ones that were usable in SQL (table, lookup, query, join) at the top. I'm not sure which rationale is more powerful. What do you think?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409124465
 
 

 ##########
 File path: docs/querying/lookups.md
 ##########
 @@ -55,21 +56,37 @@ Other lookup types are available as extensions, including:
 Query Syntax
 ------------
 
-In [Druid SQL](sql.html), lookups can be queried using the `LOOKUP` function, for example:
+In [Druid SQL](sql.html), lookups can be queried using the [`LOOKUP` function](sql.md#string-functions), for example:
 
+```sql
+SELECT
+  LOOKUP(store, 'store_to_country') AS country,
+  SUM(revenue)
+FROM sales
+GROUP BY 1
 ```
-SELECT LOOKUP(column_name, 'lookup-name'), COUNT(*) FROM datasource GROUP BY 1
+
+They can also be queried using the [JOIN operator](datasource.md#join):
 
 Review comment:
   I don't think so, because the assumption should be you can direct query it (why not!)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
suneet-s commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409072313
 
 

 ##########
 File path: docs/querying/post-aggregations.md
 ##########
 @@ -1,6 +1,6 @@
 ---
 id: post-aggregations
-title: "Post-Aggregations"
+title: "Postaggregations"
 
 Review comment:
   The title looks strange as one word. Is `postaggregation` a word?
   ```suggestion
   title: "Post Aggregations"
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] sthetland commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
sthetland commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r408997329
 
 

 ##########
 File path: website/i18n/en.json
 ##########
 @@ -408,17 +417,21 @@
         "title": "Multi-value dimensions"
       },
       "querying/multitenancy": {
-        "title": "Multitenancy considerations"
+        "title": "Multitenancy considerations",
+        "sidebar_label": "Multitenancy"
       },
       "querying/post-aggregations": {
-        "title": "Post-Aggregations"
+        "title": "Postaggregations"
 
 Review comment:
   "Post-aggregations" works for me. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] sthetland commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
sthetland commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r408948194
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always creating or writing into a table
+datasource. But at query time, there are many other types of datasources available.
 
-A data source is the Apache Druid equivalent of a database table. However, a query can also masquerade as a data source, providing subquery-like functionality. Query data sources are currently supported only by [GroupBy](../querying/groupbyquery.md) queries.
+In the [Druid SQL](sql.html) language, datasources are provided in the [`FROM` clause](sql.html#from).
 
-### Table datasource
-The table data source is the most common type. It's represented by a string, or by the full structure:
+The word "datasource" is generally spelled `dataSource` (with a capital S) when it appears in API requests and
+responses.
 
+## Datasource type
+
+### `table`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT column1, column2 FROM "druid"."dataSourceName"
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": "dataSourceName",
+  "columns": ["column1", "column2"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+The table datasource is the most common type. This is the kind of datasource you get when you perform
+[data ingestion](../ingestion/index.html). They are split up into segments, distributed around the cluster,
+and queried in parallel.
+
+In [Druid SQL](sql.html#from), table datasources reside in the the `druid` schema. This is the default schema, so table
+datasources can be referenced as either `druid.dataSourceName` or simply `dataSourceName`.
+
+In native queries, table datasources can be referenced using their names as strings (as in the example above), or by
+using JSON objects of the form:
+
+```json
+"dataSource": {
+  "type": "table",
+  "name": "dataSourceName"
+}
+```
+
+To see a list of all table datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'druid'`.
+
+### `lookup`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT k, v FROM lookup.countries
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": {
+    "type": "lookup",
+    "lookup": "countries"
+  },
+  "columns": ["k", "v"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Lookup datasources correspond to Druid's key-value [lookup](lookups.html) objects. In [Druid SQL](sql.html#from),
+they reside in the the `lookup` schema. They are preloaded in memory on all servers, so they can be accessed rapidly.
+They can be joined onto regular tables using the [join operator](#join).
+
+Lookup datasources are key-value oriented and always have exactly two columns: `k` (the key) and `v` (the value), and
+both are always strings.
+
+To see a list of all lookup datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'lookup'`.
+
+> Performance tip: Lookups can be joined with a base table either using an explicit [join](#join), or by using the
+> SQL [`LOOKUP` function](sql.html#string-functions).
+> However, the join operator must evaluate the condition on each row, whereas the
+> `LOOKUP` function can defer evaluation until after an aggregation phase. This means that the `LOOKUP` function is
+> usually faster than joining to a lookup datasource.
+
+### `query`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Uses a subquery to count hits per page, then takes the average.
+SELECT
+  AVG(cnt) AS average_hits_per_page
+FROM
+  (SELECT page, COUNT(*) AS hits FROM site_traffic GROUP BY page)
+```
+<!--Native-->
+```json
+{
+  "queryType": "timeseries",
+  "dataSource": {
+    "type": "query",
+    "query": {
+      "queryType": "groupBy",
+      "dataSource": "site_traffic",
+      "intervals": ["0000/3000"],
+      "granularity": "all",
+      "dimensions": ["page"],
+      "aggregations": [
+        { "type": "count", "name": "hits" }
+      ]
+    }
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "aggregations": [
+    { "type": "longSum", "name": "hits", "fieldName": "hits" },
+    { "type": "count", "name": "pages" }
+  ],
+  "postAggregations": [
+    { "type": "expression", "name": "average_hits_per_page", "expression": "hits / pages" }
+  ]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Query datasources allow you to issue subqueries. In native queries, they can appear anywhere that accepts a
+`dataSource`. In SQL, they can appear in the following places, always surrounded by parentheses:
+
+- The FROM clause: `FROM (<subquery>)`.
+- As inputs to a JOIN: `<table-or-subquery-1> t1 INNER JOIN <table-or-subquery-2> t2 ON t1.<col1> = t2.<col2>`.
+- In the WHERE clause: `WHERE <column> { IN | NOT IN } (<subquery>)`. These are translated to joins by the SQL planner.
+
+> Performance tip: Subquery results need to be fully transferred to the Broker as part of query execution.
+> This means that subqueries with large result sets can cause performance bottlenecks or run into memory usage limits.
+> See the [Query execution](query-execution.md) documentation for more details on how subqueries are executed and what
+> limits will apply.
+
+### `join`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Joins "sales" with "countries" (using "store" as the join key) to get sales by country.
+SELECT
+  countries.v AS country,
+  SUM(sales.revenue) AS country_revenue
+FROM
+  sales
+  INNER JOIN lookup.store_to_country ON sales.store = store_to_country.k
+GROUP BY
+  countries.v
+```
+<!--Native-->
 ```json
 {
-	"type": "table",
-	"name": "<string_value>"
+  "queryType": "groupBy",
+  "dataSource": {
+    "type": "join",
+    "left": "sales",
+    "right": {
+      "type": "lookup",
+      "lookup": "store_to_country"
+    },
+    "rightPrefix": "r.",
+    "condition": "store == \"r.k\"",
+    "joinType": "INNER"
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "dimensions": [
+    { "type": "default", "outputName": "country", "dimension": "r.v" }
+  ],
+  "aggregations": [
+    { "type": "longSum", "name": "country_revenue", "fieldName": "revenue" }
+  ]
 }
 ```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Join datasources allow you to do a SQL-style join of two datasources. Stacking joins on top of each other allows
+you to join arbitrarily many datasources.
+
+In Druid {{DRUIDVERSION}}, joins are implemented with a broadcast hash-join algorithm. This means that all tables
+other than the leftmost "base" table must fit in memory. It also means that the join condition must be an equality. This
+feature is intended mainly to allow joining regular Druid tables with [lookup](#lookup), [inline](#inline), and
+[query](#query) datasources.
+
+For information about how Druid executes queries involving joins, refer to the
+[Query execution](query-execution.html#join) page.
+
+#### Joins in SQL
+
+SQL joins take the form:
+
+```
+<o1> [ INNER | LEFT [OUTER] ] JOIN <o2> ON <condition>
+```
+
+The condition must involve only equalities, but functions are okay, and there can be multiple equalities ANDed together.
+Conditions like `t1.x = t2.x`, or `LOWER(t1.x) = t2.x`, or `t1.x = t2.x AND t1.y = t2.y` can all be handled. Conditions
+like `t1.x <> t2.x` cannot currently be handled.
 
-### Union datasource
+Note that Druid SQL is less rigid than what native join datasources can handle. In cases where a SQL query does
+something that is not allowed as-is with a native join datasource, Druid SQL will generate a subquery. This can have
+a substantial effect on performance and scalability, so it is something to watch out for. Some examples of when the
+SQL layer will generate subqueries include:
 
-This data source unions two or more table data sources.
+- Joining a regular Druid table to itself, or to another regular Druid table. The native join datasource can accept
+a table on the left-hand side, but not the right, so a subquery is needed.
 
+- Join conditions where the expressions on either side are of different types.
+
+- Join conditions where the right-hand expression is not a direct column access.
+
+For more information about how Druid translates SQL to native queries, refer to the
+[Druid SQL](sql.md#query-translation) documentation.
+
+#### Joins in native queries
+
+Native join datasources have the following properties. All are required.
+
+|Field|Description|
+|-----|-----------|
+|`left`|Left-hand datasource. Must be of type `table`, `join`, `lookup`, `query`, or `inline`. Placing another join as the left datasource allows you to join arbitrarily many datasources.|
+|`right`|Right-hand datasource. Must be of type `lookup`, `query`, or `inline`. Note that this is more rigid than what Druid SQL requires.|
+|`rightPrefix`|String prefix that will be applied to all columns from the right-hand datasource, to prevent them from colliding with columns from the left-hand datasource. Can be any string, so long as it is nonempty and is not be a prefix of the string `__time`. Any columns from the left-hand side that start with your `rightPrefix` will be shadowed. It is up to you to provide a prefix that will not shadow any important columns from the left side.|
+|`condition`|[Expression](../misc/math-expr.md) that must be an equality where one side is an expression of the left-hand side, and the other side is a simple column reference to the right-hand side. Note that this is more rigid than what Druid SQL requires: here, the right-hand reference must be a simple column reference; in SQL it can be an expression.|
+|`joinType`|`INNER` or `LEFT`.|
+
+#### Join performance
+
+Joins are a feature that can significantly affect performance of your queries. Some performance tips and notes:
+
+1. Joins are especially useful with [lookup datasources](#lookup), but in most cases, the
+[`LOOKUP` function](sql.html#string-functions) performs better than a join. Consider using the `LOOKUP` function if
+it is appropriate for your use case.
+2. When using joins in Druid SQL, keep in mind that it can generate subqueries that you did not explicitly include in
+your queries. Refer to the [Druid SQL](sql.md#query-translation) documentation for more details about when this happens
+and how to detect it.
+3. One common reason for implicit subquery generation is if the types of the two halves of an equality do not match.
+For example, since lookup keys are always strings, the condition `druid.d JOIN lookup.l ON d.field = l.field` will
+perform best if `d.field` is a string.
+4. As of Druid {{DRUIDVERSION}}, the join operator must evaluate the condition for each row. In the future, we expect
+to implement both early and deferred condition evaluation, which we expect to improve performance considerably for
+common use cases.
+
+#### Future work for joins
+
+Joins are an area of active development in Druid. The following features are missing today but may appear in
+future versions:
+
+- Preloaded dimension tables that are wider than lookups (i.e. supporting more than a single key and single value).
+- RIGHT OUTER and FULL OUTER joins. Currently, they are partially implemented. Queries will run but results will not
+always be correct.
 
 Review comment:
   Is it that result "may not always be complete"? ... or any other way to characterize the incorrectness, if possible? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409081206
 
 

 ##########
 File path: docs/querying/lookups.md
 ##########
 @@ -55,21 +56,37 @@ Other lookup types are available as extensions, including:
 Query Syntax
 ------------
 
-In [Druid SQL](sql.html), lookups can be queried using the `LOOKUP` function, for example:
+In [Druid SQL](sql.html), lookups can be queried using the [`LOOKUP` function](sql.md#string-functions), for example:
 
+```sql
+SELECT
+  LOOKUP(store, 'store_to_country') AS country,
+  SUM(revenue)
+FROM sales
+GROUP BY 1
 ```
-SELECT LOOKUP(column_name, 'lookup-name'), COUNT(*) FROM datasource GROUP BY 1
+
+They can also be queried using the [JOIN operator](datasource.md#join):
+
+```sql
+SELECT
+  countries.v AS country,
 
 Review comment:
   Yes it should. Thanks.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] sthetland commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
sthetland commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r408945687
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always creating or writing into a table
+datasource. But at query time, there are many other types of datasources available.
 
-A data source is the Apache Druid equivalent of a database table. However, a query can also masquerade as a data source, providing subquery-like functionality. Query data sources are currently supported only by [GroupBy](../querying/groupbyquery.md) queries.
+In the [Druid SQL](sql.html) language, datasources are provided in the [`FROM` clause](sql.html#from).
 
-### Table datasource
-The table data source is the most common type. It's represented by a string, or by the full structure:
+The word "datasource" is generally spelled `dataSource` (with a capital S) when it appears in API requests and
+responses.
 
+## Datasource type
+
+### `table`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT column1, column2 FROM "druid"."dataSourceName"
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": "dataSourceName",
+  "columns": ["column1", "column2"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+The table datasource is the most common type. This is the kind of datasource you get when you perform
+[data ingestion](../ingestion/index.html). They are split up into segments, distributed around the cluster,
+and queried in parallel.
+
+In [Druid SQL](sql.html#from), table datasources reside in the the `druid` schema. This is the default schema, so table
+datasources can be referenced as either `druid.dataSourceName` or simply `dataSourceName`.
+
+In native queries, table datasources can be referenced using their names as strings (as in the example above), or by
+using JSON objects of the form:
+
+```json
+"dataSource": {
+  "type": "table",
+  "name": "dataSourceName"
+}
+```
+
+To see a list of all table datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'druid'`.
+
+### `lookup`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT k, v FROM lookup.countries
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": {
+    "type": "lookup",
+    "lookup": "countries"
+  },
+  "columns": ["k", "v"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Lookup datasources correspond to Druid's key-value [lookup](lookups.html) objects. In [Druid SQL](sql.html#from),
+they reside in the the `lookup` schema. They are preloaded in memory on all servers, so they can be accessed rapidly.
+They can be joined onto regular tables using the [join operator](#join).
+
+Lookup datasources are key-value oriented and always have exactly two columns: `k` (the key) and `v` (the value), and
+both are always strings.
+
+To see a list of all lookup datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'lookup'`.
+
+> Performance tip: Lookups can be joined with a base table either using an explicit [join](#join), or by using the
+> SQL [`LOOKUP` function](sql.html#string-functions).
+> However, the join operator must evaluate the condition on each row, whereas the
+> `LOOKUP` function can defer evaluation until after an aggregation phase. This means that the `LOOKUP` function is
+> usually faster than joining to a lookup datasource.
+
+### `query`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Uses a subquery to count hits per page, then takes the average.
+SELECT
+  AVG(cnt) AS average_hits_per_page
+FROM
+  (SELECT page, COUNT(*) AS hits FROM site_traffic GROUP BY page)
+```
+<!--Native-->
+```json
+{
+  "queryType": "timeseries",
+  "dataSource": {
+    "type": "query",
+    "query": {
+      "queryType": "groupBy",
+      "dataSource": "site_traffic",
+      "intervals": ["0000/3000"],
+      "granularity": "all",
+      "dimensions": ["page"],
+      "aggregations": [
+        { "type": "count", "name": "hits" }
+      ]
+    }
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "aggregations": [
+    { "type": "longSum", "name": "hits", "fieldName": "hits" },
+    { "type": "count", "name": "pages" }
+  ],
+  "postAggregations": [
+    { "type": "expression", "name": "average_hits_per_page", "expression": "hits / pages" }
+  ]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Query datasources allow you to issue subqueries. In native queries, they can appear anywhere that accepts a
+`dataSource`. In SQL, they can appear in the following places, always surrounded by parentheses:
+
+- The FROM clause: `FROM (<subquery>)`.
+- As inputs to a JOIN: `<table-or-subquery-1> t1 INNER JOIN <table-or-subquery-2> t2 ON t1.<col1> = t2.<col2>`.
+- In the WHERE clause: `WHERE <column> { IN | NOT IN } (<subquery>)`. These are translated to joins by the SQL planner.
+
+> Performance tip: Subquery results need to be fully transferred to the Broker as part of query execution.
+> This means that subqueries with large result sets can cause performance bottlenecks or run into memory usage limits.
+> See the [Query execution](query-execution.md) documentation for more details on how subqueries are executed and what
+> limits will apply.
+
+### `join`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Joins "sales" with "countries" (using "store" as the join key) to get sales by country.
+SELECT
+  countries.v AS country,
+  SUM(sales.revenue) AS country_revenue
+FROM
+  sales
+  INNER JOIN lookup.store_to_country ON sales.store = store_to_country.k
+GROUP BY
+  countries.v
+```
+<!--Native-->
 ```json
 {
-	"type": "table",
-	"name": "<string_value>"
+  "queryType": "groupBy",
+  "dataSource": {
+    "type": "join",
+    "left": "sales",
+    "right": {
+      "type": "lookup",
+      "lookup": "store_to_country"
+    },
+    "rightPrefix": "r.",
+    "condition": "store == \"r.k\"",
+    "joinType": "INNER"
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "dimensions": [
+    { "type": "default", "outputName": "country", "dimension": "r.v" }
+  ],
+  "aggregations": [
+    { "type": "longSum", "name": "country_revenue", "fieldName": "revenue" }
+  ]
 }
 ```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Join datasources allow you to do a SQL-style join of two datasources. Stacking joins on top of each other allows
+you to join arbitrarily many datasources.
+
+In Druid {{DRUIDVERSION}}, joins are implemented with a broadcast hash-join algorithm. This means that all tables
+other than the leftmost "base" table must fit in memory. It also means that the join condition must be an equality. This
+feature is intended mainly to allow joining regular Druid tables with [lookup](#lookup), [inline](#inline), and
+[query](#query) datasources.
+
+For information about how Druid executes queries involving joins, refer to the
+[Query execution](query-execution.html#join) page.
+
+#### Joins in SQL
+
+SQL joins take the form:
+
+```
+<o1> [ INNER | LEFT [OUTER] ] JOIN <o2> ON <condition>
+```
+
+The condition must involve only equalities, but functions are okay, and there can be multiple equalities ANDed together.
+Conditions like `t1.x = t2.x`, or `LOWER(t1.x) = t2.x`, or `t1.x = t2.x AND t1.y = t2.y` can all be handled. Conditions
+like `t1.x <> t2.x` cannot currently be handled.
 
-### Union datasource
+Note that Druid SQL is less rigid than what native join datasources can handle. In cases where a SQL query does
+something that is not allowed as-is with a native join datasource, Druid SQL will generate a subquery. This can have
+a substantial effect on performance and scalability, so it is something to watch out for. Some examples of when the
+SQL layer will generate subqueries include:
 
-This data source unions two or more table data sources.
+- Joining a regular Druid table to itself, or to another regular Druid table. The native join datasource can accept
+a table on the left-hand side, but not the right, so a subquery is needed.
 
+- Join conditions where the expressions on either side are of different types.
+
+- Join conditions where the right-hand expression is not a direct column access.
+
+For more information about how Druid translates SQL to native queries, refer to the
+[Druid SQL](sql.md#query-translation) documentation.
+
+#### Joins in native queries
+
+Native join datasources have the following properties. All are required.
+
+|Field|Description|
+|-----|-----------|
+|`left`|Left-hand datasource. Must be of type `table`, `join`, `lookup`, `query`, or `inline`. Placing another join as the left datasource allows you to join arbitrarily many datasources.|
+|`right`|Right-hand datasource. Must be of type `lookup`, `query`, or `inline`. Note that this is more rigid than what Druid SQL requires.|
+|`rightPrefix`|String prefix that will be applied to all columns from the right-hand datasource, to prevent them from colliding with columns from the left-hand datasource. Can be any string, so long as it is nonempty and is not be a prefix of the string `__time`. Any columns from the left-hand side that start with your `rightPrefix` will be shadowed. It is up to you to provide a prefix that will not shadow any important columns from the left side.|
 
 Review comment:
   ```suggestion
   |`rightPrefix`|String prefix that will be applied to all columns from the right-hand datasource to prevent collisions with columns from the left-hand datasource. Can be any string, so long as it is nonempty and is not a prefix of the string `__time`. Any columns from the left-hand side that start with your `rightPrefix` will be shadowed. It is up to you to provide a prefix that will not shadow any important columns from the left side.|
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409087803
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always creating or writing into a table
+datasource. But at query time, there are many other types of datasources available.
 
 Review comment:
   I don't think so, because the thing being linked to is immediately below.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] ccaominh commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
ccaominh commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r408977929
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always creating or writing into a table
+datasource. But at query time, there are many other types of datasources available.
 
-A data source is the Apache Druid equivalent of a database table. However, a query can also masquerade as a data source, providing subquery-like functionality. Query data sources are currently supported only by [GroupBy](../querying/groupbyquery.md) queries.
+In the [Druid SQL](sql.html) language, datasources are provided in the [`FROM` clause](sql.html#from).
 
-### Table datasource
-The table data source is the most common type. It's represented by a string, or by the full structure:
+The word "datasource" is generally spelled `dataSource` (with a capital S) when it appears in API requests and
+responses.
 
+## Datasource type
+
+### `table`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT column1, column2 FROM "druid"."dataSourceName"
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": "dataSourceName",
+  "columns": ["column1", "column2"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+The table datasource is the most common type. This is the kind of datasource you get when you perform
+[data ingestion](../ingestion/index.html). They are split up into segments, distributed around the cluster,
+and queried in parallel.
+
+In [Druid SQL](sql.html#from), table datasources reside in the the `druid` schema. This is the default schema, so table
+datasources can be referenced as either `druid.dataSourceName` or simply `dataSourceName`.
+
+In native queries, table datasources can be referenced using their names as strings (as in the example above), or by
+using JSON objects of the form:
+
+```json
+"dataSource": {
+  "type": "table",
+  "name": "dataSourceName"
+}
+```
+
+To see a list of all table datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'druid'`.
+
+### `lookup`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT k, v FROM lookup.countries
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": {
+    "type": "lookup",
+    "lookup": "countries"
+  },
+  "columns": ["k", "v"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Lookup datasources correspond to Druid's key-value [lookup](lookups.html) objects. In [Druid SQL](sql.html#from),
+they reside in the the `lookup` schema. They are preloaded in memory on all servers, so they can be accessed rapidly.
+They can be joined onto regular tables using the [join operator](#join).
+
+Lookup datasources are key-value oriented and always have exactly two columns: `k` (the key) and `v` (the value), and
+both are always strings.
+
+To see a list of all lookup datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'lookup'`.
+
+> Performance tip: Lookups can be joined with a base table either using an explicit [join](#join), or by using the
+> SQL [`LOOKUP` function](sql.html#string-functions).
+> However, the join operator must evaluate the condition on each row, whereas the
+> `LOOKUP` function can defer evaluation until after an aggregation phase. This means that the `LOOKUP` function is
+> usually faster than joining to a lookup datasource.
+
+### `query`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Uses a subquery to count hits per page, then takes the average.
+SELECT
+  AVG(cnt) AS average_hits_per_page
+FROM
+  (SELECT page, COUNT(*) AS hits FROM site_traffic GROUP BY page)
+```
+<!--Native-->
+```json
+{
+  "queryType": "timeseries",
+  "dataSource": {
+    "type": "query",
+    "query": {
+      "queryType": "groupBy",
+      "dataSource": "site_traffic",
+      "intervals": ["0000/3000"],
+      "granularity": "all",
+      "dimensions": ["page"],
+      "aggregations": [
+        { "type": "count", "name": "hits" }
+      ]
+    }
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "aggregations": [
+    { "type": "longSum", "name": "hits", "fieldName": "hits" },
+    { "type": "count", "name": "pages" }
+  ],
+  "postAggregations": [
+    { "type": "expression", "name": "average_hits_per_page", "expression": "hits / pages" }
+  ]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Query datasources allow you to issue subqueries. In native queries, they can appear anywhere that accepts a
+`dataSource`. In SQL, they can appear in the following places, always surrounded by parentheses:
+
+- The FROM clause: `FROM (<subquery>)`.
+- As inputs to a JOIN: `<table-or-subquery-1> t1 INNER JOIN <table-or-subquery-2> t2 ON t1.<col1> = t2.<col2>`.
+- In the WHERE clause: `WHERE <column> { IN | NOT IN } (<subquery>)`. These are translated to joins by the SQL planner.
+
+> Performance tip: Subquery results need to be fully transferred to the Broker as part of query execution.
+> This means that subqueries with large result sets can cause performance bottlenecks or run into memory usage limits.
+> See the [Query execution](query-execution.md) documentation for more details on how subqueries are executed and what
+> limits will apply.
+
+### `join`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Joins "sales" with "countries" (using "store" as the join key) to get sales by country.
+SELECT
+  countries.v AS country,
 
 Review comment:
   Should this be `store_to_country.v` instead of `countries.v`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
suneet-s commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409133275
 
 

 ##########
 File path: docs/querying/lookups.md
 ##########
 @@ -55,21 +56,37 @@ Other lookup types are available as extensions, including:
 Query Syntax
 ------------
 
-In [Druid SQL](sql.html), lookups can be queried using the `LOOKUP` function, for example:
+In [Druid SQL](sql.html), lookups can be queried using the [`LOOKUP` function](sql.md#string-functions), for example:
 
+```sql
+SELECT
+  LOOKUP(store, 'store_to_country') AS country,
+  SUM(revenue)
+FROM sales
+GROUP BY 1
 ```
-SELECT LOOKUP(column_name, 'lookup-name'), COUNT(*) FROM datasource GROUP BY 1
+
+They can also be queried using the [JOIN operator](datasource.md#join):
 
 Review comment:
   ^ fair point, this documentation makes me think there are only 2 ways to query a lookup, but maybe it's because I'm biased and have not been able to query it before that PR :)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409083162
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always creating or writing into a table
+datasource. But at query time, there are many other types of datasources available.
 
-A data source is the Apache Druid equivalent of a database table. However, a query can also masquerade as a data source, providing subquery-like functionality. Query data sources are currently supported only by [GroupBy](../querying/groupbyquery.md) queries.
+In the [Druid SQL](sql.html) language, datasources are provided in the [`FROM` clause](sql.html#from).
 
-### Table datasource
-The table data source is the most common type. It's represented by a string, or by the full structure:
+The word "datasource" is generally spelled `dataSource` (with a capital S) when it appears in API requests and
+responses.
 
+## Datasource type
+
+### `table`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT column1, column2 FROM "druid"."dataSourceName"
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": "dataSourceName",
+  "columns": ["column1", "column2"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+The table datasource is the most common type. This is the kind of datasource you get when you perform
+[data ingestion](../ingestion/index.html). They are split up into segments, distributed around the cluster,
+and queried in parallel.
+
+In [Druid SQL](sql.html#from), table datasources reside in the the `druid` schema. This is the default schema, so table
+datasources can be referenced as either `druid.dataSourceName` or simply `dataSourceName`.
+
+In native queries, table datasources can be referenced using their names as strings (as in the example above), or by
+using JSON objects of the form:
+
+```json
+"dataSource": {
+  "type": "table",
+  "name": "dataSourceName"
+}
+```
+
+To see a list of all table datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'druid'`.
+
+### `lookup`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT k, v FROM lookup.countries
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": {
+    "type": "lookup",
+    "lookup": "countries"
+  },
+  "columns": ["k", "v"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Lookup datasources correspond to Druid's key-value [lookup](lookups.html) objects. In [Druid SQL](sql.html#from),
+they reside in the the `lookup` schema. They are preloaded in memory on all servers, so they can be accessed rapidly.
+They can be joined onto regular tables using the [join operator](#join).
+
+Lookup datasources are key-value oriented and always have exactly two columns: `k` (the key) and `v` (the value), and
+both are always strings.
+
+To see a list of all lookup datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'lookup'`.
+
+> Performance tip: Lookups can be joined with a base table either using an explicit [join](#join), or by using the
+> SQL [`LOOKUP` function](sql.html#string-functions).
+> However, the join operator must evaluate the condition on each row, whereas the
+> `LOOKUP` function can defer evaluation until after an aggregation phase. This means that the `LOOKUP` function is
+> usually faster than joining to a lookup datasource.
+
+### `query`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Uses a subquery to count hits per page, then takes the average.
+SELECT
+  AVG(cnt) AS average_hits_per_page
+FROM
+  (SELECT page, COUNT(*) AS hits FROM site_traffic GROUP BY page)
+```
+<!--Native-->
+```json
+{
+  "queryType": "timeseries",
+  "dataSource": {
+    "type": "query",
+    "query": {
+      "queryType": "groupBy",
+      "dataSource": "site_traffic",
+      "intervals": ["0000/3000"],
+      "granularity": "all",
+      "dimensions": ["page"],
+      "aggregations": [
+        { "type": "count", "name": "hits" }
+      ]
+    }
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "aggregations": [
+    { "type": "longSum", "name": "hits", "fieldName": "hits" },
+    { "type": "count", "name": "pages" }
+  ],
+  "postAggregations": [
+    { "type": "expression", "name": "average_hits_per_page", "expression": "hits / pages" }
+  ]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Query datasources allow you to issue subqueries. In native queries, they can appear anywhere that accepts a
+`dataSource`. In SQL, they can appear in the following places, always surrounded by parentheses:
+
+- The FROM clause: `FROM (<subquery>)`.
+- As inputs to a JOIN: `<table-or-subquery-1> t1 INNER JOIN <table-or-subquery-2> t2 ON t1.<col1> = t2.<col2>`.
+- In the WHERE clause: `WHERE <column> { IN | NOT IN } (<subquery>)`. These are translated to joins by the SQL planner.
+
+> Performance tip: Subquery results need to be fully transferred to the Broker as part of query execution.
+> This means that subqueries with large result sets can cause performance bottlenecks or run into memory usage limits.
+> See the [Query execution](query-execution.md) documentation for more details on how subqueries are executed and what
+> limits will apply.
+
+### `join`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Joins "sales" with "countries" (using "store" as the join key) to get sales by country.
+SELECT
+  countries.v AS country,
+  SUM(sales.revenue) AS country_revenue
+FROM
+  sales
+  INNER JOIN lookup.store_to_country ON sales.store = store_to_country.k
+GROUP BY
+  countries.v
+```
+<!--Native-->
 ```json
 {
-	"type": "table",
-	"name": "<string_value>"
+  "queryType": "groupBy",
+  "dataSource": {
+    "type": "join",
+    "left": "sales",
+    "right": {
+      "type": "lookup",
+      "lookup": "store_to_country"
+    },
+    "rightPrefix": "r.",
+    "condition": "store == \"r.k\"",
+    "joinType": "INNER"
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "dimensions": [
+    { "type": "default", "outputName": "country", "dimension": "r.v" }
+  ],
+  "aggregations": [
+    { "type": "longSum", "name": "country_revenue", "fieldName": "revenue" }
+  ]
 }
 ```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Join datasources allow you to do a SQL-style join of two datasources. Stacking joins on top of each other allows
+you to join arbitrarily many datasources.
+
+In Druid {{DRUIDVERSION}}, joins are implemented with a broadcast hash-join algorithm. This means that all tables
+other than the leftmost "base" table must fit in memory. It also means that the join condition must be an equality. This
+feature is intended mainly to allow joining regular Druid tables with [lookup](#lookup), [inline](#inline), and
+[query](#query) datasources.
+
+For information about how Druid executes queries involving joins, refer to the
+[Query execution](query-execution.html#join) page.
+
+#### Joins in SQL
+
+SQL joins take the form:
+
+```
+<o1> [ INNER | LEFT [OUTER] ] JOIN <o2> ON <condition>
+```
+
+The condition must involve only equalities, but functions are okay, and there can be multiple equalities ANDed together.
+Conditions like `t1.x = t2.x`, or `LOWER(t1.x) = t2.x`, or `t1.x = t2.x AND t1.y = t2.y` can all be handled. Conditions
+like `t1.x <> t2.x` cannot currently be handled.
 
-### Union datasource
+Note that Druid SQL is less rigid than what native join datasources can handle. In cases where a SQL query does
+something that is not allowed as-is with a native join datasource, Druid SQL will generate a subquery. This can have
+a substantial effect on performance and scalability, so it is something to watch out for. Some examples of when the
+SQL layer will generate subqueries include:
 
-This data source unions two or more table data sources.
+- Joining a regular Druid table to itself, or to another regular Druid table. The native join datasource can accept
+a table on the left-hand side, but not the right, so a subquery is needed.
 
+- Join conditions where the expressions on either side are of different types.
+
+- Join conditions where the right-hand expression is not a direct column access.
+
+For more information about how Druid translates SQL to native queries, refer to the
+[Druid SQL](sql.md#query-translation) documentation.
+
+#### Joins in native queries
+
+Native join datasources have the following properties. All are required.
+
+|Field|Description|
+|-----|-----------|
+|`left`|Left-hand datasource. Must be of type `table`, `join`, `lookup`, `query`, or `inline`. Placing another join as the left datasource allows you to join arbitrarily many datasources.|
+|`right`|Right-hand datasource. Must be of type `lookup`, `query`, or `inline`. Note that this is more rigid than what Druid SQL requires.|
+|`rightPrefix`|String prefix that will be applied to all columns from the right-hand datasource, to prevent them from colliding with columns from the left-hand datasource. Can be any string, so long as it is nonempty and is not be a prefix of the string `__time`. Any columns from the left-hand side that start with your `rightPrefix` will be shadowed. It is up to you to provide a prefix that will not shadow any important columns from the left side.|
+|`condition`|[Expression](../misc/math-expr.md) that must be an equality where one side is an expression of the left-hand side, and the other side is a simple column reference to the right-hand side. Note that this is more rigid than what Druid SQL requires: here, the right-hand reference must be a simple column reference; in SQL it can be an expression.|
+|`joinType`|`INNER` or `LEFT`.|
+
+#### Join performance
+
+Joins are a feature that can significantly affect performance of your queries. Some performance tips and notes:
+
+1. Joins are especially useful with [lookup datasources](#lookup), but in most cases, the
+[`LOOKUP` function](sql.html#string-functions) performs better than a join. Consider using the `LOOKUP` function if
+it is appropriate for your use case.
+2. When using joins in Druid SQL, keep in mind that it can generate subqueries that you did not explicitly include in
+your queries. Refer to the [Druid SQL](sql.md#query-translation) documentation for more details about when this happens
+and how to detect it.
+3. One common reason for implicit subquery generation is if the types of the two halves of an equality do not match.
+For example, since lookup keys are always strings, the condition `druid.d JOIN lookup.l ON d.field = l.field` will
+perform best if `d.field` is a string.
+4. As of Druid {{DRUIDVERSION}}, the join operator must evaluate the condition for each row. In the future, we expect
 
 Review comment:
   It is. I just wanted to say it twice so people get the point.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409093828
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always creating or writing into a table
+datasource. But at query time, there are many other types of datasources available.
 
-A data source is the Apache Druid equivalent of a database table. However, a query can also masquerade as a data source, providing subquery-like functionality. Query data sources are currently supported only by [GroupBy](../querying/groupbyquery.md) queries.
+In the [Druid SQL](sql.html) language, datasources are provided in the [`FROM` clause](sql.html#from).
 
-### Table datasource
-The table data source is the most common type. It's represented by a string, or by the full structure:
+The word "datasource" is generally spelled `dataSource` (with a capital S) when it appears in API requests and
+responses.
 
+## Datasource type
+
+### `table`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT column1, column2 FROM "druid"."dataSourceName"
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": "dataSourceName",
+  "columns": ["column1", "column2"],
+  "intervals": ["0000/3000"]
 
 Review comment:
   They do if `site_traffic` dataset doesn't have any data outside this interval 🙂
   
   My thinking here was that we're not holding these out as the queries that Druid SQL would generate based on these native queries. So I took the opportunity to make them look nicer. In addition to using prettier intervals I also used prettier column names.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
suneet-s commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409037770
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always creating or writing into a table
+datasource. But at query time, there are many other types of datasources available.
 
-A data source is the Apache Druid equivalent of a database table. However, a query can also masquerade as a data source, providing subquery-like functionality. Query data sources are currently supported only by [GroupBy](../querying/groupbyquery.md) queries.
+In the [Druid SQL](sql.html) language, datasources are provided in the [`FROM` clause](sql.html#from).
 
-### Table datasource
-The table data source is the most common type. It's represented by a string, or by the full structure:
+The word "datasource" is generally spelled `dataSource` (with a capital S) when it appears in API requests and
+responses.
 
+## Datasource type
+
+### `table`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT column1, column2 FROM "druid"."dataSourceName"
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": "dataSourceName",
+  "columns": ["column1", "column2"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+The table datasource is the most common type. This is the kind of datasource you get when you perform
+[data ingestion](../ingestion/index.html). They are split up into segments, distributed around the cluster,
+and queried in parallel.
+
+In [Druid SQL](sql.html#from), table datasources reside in the the `druid` schema. This is the default schema, so table
+datasources can be referenced as either `druid.dataSourceName` or simply `dataSourceName`.
+
+In native queries, table datasources can be referenced using their names as strings (as in the example above), or by
+using JSON objects of the form:
+
+```json
+"dataSource": {
+  "type": "table",
+  "name": "dataSourceName"
+}
+```
+
+To see a list of all table datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'druid'`.
+
+### `lookup`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT k, v FROM lookup.countries
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": {
+    "type": "lookup",
+    "lookup": "countries"
+  },
+  "columns": ["k", "v"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Lookup datasources correspond to Druid's key-value [lookup](lookups.html) objects. In [Druid SQL](sql.html#from),
+they reside in the the `lookup` schema. They are preloaded in memory on all servers, so they can be accessed rapidly.
+They can be joined onto regular tables using the [join operator](#join).
+
+Lookup datasources are key-value oriented and always have exactly two columns: `k` (the key) and `v` (the value), and
+both are always strings.
+
+To see a list of all lookup datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'lookup'`.
+
+> Performance tip: Lookups can be joined with a base table either using an explicit [join](#join), or by using the
+> SQL [`LOOKUP` function](sql.html#string-functions).
+> However, the join operator must evaluate the condition on each row, whereas the
+> `LOOKUP` function can defer evaluation until after an aggregation phase. This means that the `LOOKUP` function is
+> usually faster than joining to a lookup datasource.
+
+### `query`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Uses a subquery to count hits per page, then takes the average.
+SELECT
+  AVG(cnt) AS average_hits_per_page
+FROM
+  (SELECT page, COUNT(*) AS hits FROM site_traffic GROUP BY page)
+```
+<!--Native-->
+```json
+{
+  "queryType": "timeseries",
+  "dataSource": {
+    "type": "query",
+    "query": {
+      "queryType": "groupBy",
+      "dataSource": "site_traffic",
+      "intervals": ["0000/3000"],
+      "granularity": "all",
+      "dimensions": ["page"],
+      "aggregations": [
+        { "type": "count", "name": "hits" }
+      ]
+    }
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "aggregations": [
+    { "type": "longSum", "name": "hits", "fieldName": "hits" },
+    { "type": "count", "name": "pages" }
+  ],
+  "postAggregations": [
+    { "type": "expression", "name": "average_hits_per_page", "expression": "hits / pages" }
+  ]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Query datasources allow you to issue subqueries. In native queries, they can appear anywhere that accepts a
+`dataSource`. In SQL, they can appear in the following places, always surrounded by parentheses:
+
+- The FROM clause: `FROM (<subquery>)`.
+- As inputs to a JOIN: `<table-or-subquery-1> t1 INNER JOIN <table-or-subquery-2> t2 ON t1.<col1> = t2.<col2>`.
+- In the WHERE clause: `WHERE <column> { IN | NOT IN } (<subquery>)`. These are translated to joins by the SQL planner.
+
+> Performance tip: Subquery results need to be fully transferred to the Broker as part of query execution.
+> This means that subqueries with large result sets can cause performance bottlenecks or run into memory usage limits.
+> See the [Query execution](query-execution.md) documentation for more details on how subqueries are executed and what
+> limits will apply.
+
+### `join`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Joins "sales" with "countries" (using "store" as the join key) to get sales by country.
+SELECT
+  countries.v AS country,
+  SUM(sales.revenue) AS country_revenue
+FROM
+  sales
+  INNER JOIN lookup.store_to_country ON sales.store = store_to_country.k
+GROUP BY
+  countries.v
+```
+<!--Native-->
 ```json
 {
-	"type": "table",
-	"name": "<string_value>"
+  "queryType": "groupBy",
+  "dataSource": {
+    "type": "join",
+    "left": "sales",
+    "right": {
+      "type": "lookup",
+      "lookup": "store_to_country"
+    },
+    "rightPrefix": "r.",
+    "condition": "store == \"r.k\"",
+    "joinType": "INNER"
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "dimensions": [
+    { "type": "default", "outputName": "country", "dimension": "r.v" }
+  ],
+  "aggregations": [
+    { "type": "longSum", "name": "country_revenue", "fieldName": "revenue" }
+  ]
 }
 ```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Join datasources allow you to do a SQL-style join of two datasources. Stacking joins on top of each other allows
+you to join arbitrarily many datasources.
+
+In Druid {{DRUIDVERSION}}, joins are implemented with a broadcast hash-join algorithm. This means that all tables
+other than the leftmost "base" table must fit in memory. It also means that the join condition must be an equality. This
+feature is intended mainly to allow joining regular Druid tables with [lookup](#lookup), [inline](#inline), and
+[query](#query) datasources.
+
+For information about how Druid executes queries involving joins, refer to the
+[Query execution](query-execution.html#join) page.
+
+#### Joins in SQL
+
+SQL joins take the form:
+
+```
+<o1> [ INNER | LEFT [OUTER] ] JOIN <o2> ON <condition>
+```
+
+The condition must involve only equalities, but functions are okay, and there can be multiple equalities ANDed together.
+Conditions like `t1.x = t2.x`, or `LOWER(t1.x) = t2.x`, or `t1.x = t2.x AND t1.y = t2.y` can all be handled. Conditions
+like `t1.x <> t2.x` cannot currently be handled.
 
-### Union datasource
+Note that Druid SQL is less rigid than what native join datasources can handle. In cases where a SQL query does
+something that is not allowed as-is with a native join datasource, Druid SQL will generate a subquery. This can have
+a substantial effect on performance and scalability, so it is something to watch out for. Some examples of when the
+SQL layer will generate subqueries include:
 
-This data source unions two or more table data sources.
+- Joining a regular Druid table to itself, or to another regular Druid table. The native join datasource can accept
+a table on the left-hand side, but not the right, so a subquery is needed.
 
+- Join conditions where the expressions on either side are of different types.
+
+- Join conditions where the right-hand expression is not a direct column access.
+
+For more information about how Druid translates SQL to native queries, refer to the
+[Druid SQL](sql.md#query-translation) documentation.
+
+#### Joins in native queries
+
+Native join datasources have the following properties. All are required.
+
+|Field|Description|
+|-----|-----------|
+|`left`|Left-hand datasource. Must be of type `table`, `join`, `lookup`, `query`, or `inline`. Placing another join as the left datasource allows you to join arbitrarily many datasources.|
+|`right`|Right-hand datasource. Must be of type `lookup`, `query`, or `inline`. Note that this is more rigid than what Druid SQL requires.|
+|`rightPrefix`|String prefix that will be applied to all columns from the right-hand datasource, to prevent them from colliding with columns from the left-hand datasource. Can be any string, so long as it is nonempty and is not be a prefix of the string `__time`. Any columns from the left-hand side that start with your `rightPrefix` will be shadowed. It is up to you to provide a prefix that will not shadow any important columns from the left side.|
+|`condition`|[Expression](../misc/math-expr.md) that must be an equality where one side is an expression of the left-hand side, and the other side is a simple column reference to the right-hand side. Note that this is more rigid than what Druid SQL requires: here, the right-hand reference must be a simple column reference; in SQL it can be an expression.|
+|`joinType`|`INNER` or `LEFT`.|
+
+#### Join performance
+
+Joins are a feature that can significantly affect performance of your queries. Some performance tips and notes:
+
+1. Joins are especially useful with [lookup datasources](#lookup), but in most cases, the
+[`LOOKUP` function](sql.html#string-functions) performs better than a join. Consider using the `LOOKUP` function if
+it is appropriate for your use case.
+2. When using joins in Druid SQL, keep in mind that it can generate subqueries that you did not explicitly include in
+your queries. Refer to the [Druid SQL](sql.md#query-translation) documentation for more details about when this happens
+and how to detect it.
+3. One common reason for implicit subquery generation is if the types of the two halves of an equality do not match.
+For example, since lookup keys are always strings, the condition `druid.d JOIN lookup.l ON d.field = l.field` will
+perform best if `d.field` is a string.
+4. As of Druid {{DRUIDVERSION}}, the join operator must evaluate the condition for each row. In the future, we expect
+to implement both early and deferred condition evaluation, which we expect to improve performance considerably for
+common use cases.
+
+#### Future work for joins
+
+Joins are an area of active development in Druid. The following features are missing today but may appear in
+future versions:
+
+- Preloaded dimension tables that are wider than lookups (i.e. supporting more than a single key and single value).
+- RIGHT OUTER and FULL OUTER joins. Currently, they are partially implemented. Queries will run but results will not
+always be correct.
+- Performance-related optimizations as mentioned in the [previous section](#join-performance).
+- Join algorithms other than broadcast hash-joins.
+
+### `union`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--Native-->
 ```json
 {
-       "type": "union",
-       "dataSources": ["<string_value1>", "<string_value2>", "<string_value3>", ... ]
+  "queryType": "scan",
+  "dataSource": {
+    "type": "union",
+    "dataSources": ["<dataSourceName1>", "<dataSourceName2>", "<dataSourceName3>"]
+  },
+  "columns": ["column1", "column2"],
+  "intervals": ["0000/3000"]
 }
 ```
+<!--END_DOCUSAURUS_CODE_TABS-->
 
-Note that the data sources being unioned should have the same schema.
-Union Queries should be always sent to a Broker/Router process and are *NOT* supported directly by the Historical processes.
+Union datasources allow you to treat two or more table datasources as a single datasource. The datasources being unioned
+do not need to have identical schemas. If they do not fully match up, then columns that exist in one table but not
+another will be treated as if they contained all null values in the tables where they do not exist.
 
 Review comment:
   Are there any performance notable implications of using union datasources?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
suneet-s commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409035352
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always creating or writing into a table
+datasource. But at query time, there are many other types of datasources available.
 
-A data source is the Apache Druid equivalent of a database table. However, a query can also masquerade as a data source, providing subquery-like functionality. Query data sources are currently supported only by [GroupBy](../querying/groupbyquery.md) queries.
+In the [Druid SQL](sql.html) language, datasources are provided in the [`FROM` clause](sql.html#from).
 
-### Table datasource
-The table data source is the most common type. It's represented by a string, or by the full structure:
+The word "datasource" is generally spelled `dataSource` (with a capital S) when it appears in API requests and
+responses.
 
+## Datasource type
+
+### `table`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT column1, column2 FROM "druid"."dataSourceName"
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": "dataSourceName",
+  "columns": ["column1", "column2"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+The table datasource is the most common type. This is the kind of datasource you get when you perform
+[data ingestion](../ingestion/index.html). They are split up into segments, distributed around the cluster,
+and queried in parallel.
+
+In [Druid SQL](sql.html#from), table datasources reside in the the `druid` schema. This is the default schema, so table
+datasources can be referenced as either `druid.dataSourceName` or simply `dataSourceName`.
+
+In native queries, table datasources can be referenced using their names as strings (as in the example above), or by
+using JSON objects of the form:
+
+```json
+"dataSource": {
+  "type": "table",
+  "name": "dataSourceName"
+}
+```
+
+To see a list of all table datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'druid'`.
+
+### `lookup`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT k, v FROM lookup.countries
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": {
+    "type": "lookup",
+    "lookup": "countries"
+  },
+  "columns": ["k", "v"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Lookup datasources correspond to Druid's key-value [lookup](lookups.html) objects. In [Druid SQL](sql.html#from),
+they reside in the the `lookup` schema. They are preloaded in memory on all servers, so they can be accessed rapidly.
+They can be joined onto regular tables using the [join operator](#join).
+
+Lookup datasources are key-value oriented and always have exactly two columns: `k` (the key) and `v` (the value), and
+both are always strings.
+
+To see a list of all lookup datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'lookup'`.
+
+> Performance tip: Lookups can be joined with a base table either using an explicit [join](#join), or by using the
+> SQL [`LOOKUP` function](sql.html#string-functions).
+> However, the join operator must evaluate the condition on each row, whereas the
+> `LOOKUP` function can defer evaluation until after an aggregation phase. This means that the `LOOKUP` function is
+> usually faster than joining to a lookup datasource.
+
+### `query`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Uses a subquery to count hits per page, then takes the average.
+SELECT
+  AVG(cnt) AS average_hits_per_page
+FROM
+  (SELECT page, COUNT(*) AS hits FROM site_traffic GROUP BY page)
+```
+<!--Native-->
+```json
+{
+  "queryType": "timeseries",
+  "dataSource": {
+    "type": "query",
+    "query": {
+      "queryType": "groupBy",
+      "dataSource": "site_traffic",
+      "intervals": ["0000/3000"],
+      "granularity": "all",
+      "dimensions": ["page"],
+      "aggregations": [
+        { "type": "count", "name": "hits" }
+      ]
+    }
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "aggregations": [
+    { "type": "longSum", "name": "hits", "fieldName": "hits" },
+    { "type": "count", "name": "pages" }
+  ],
+  "postAggregations": [
+    { "type": "expression", "name": "average_hits_per_page", "expression": "hits / pages" }
+  ]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Query datasources allow you to issue subqueries. In native queries, they can appear anywhere that accepts a
+`dataSource`. In SQL, they can appear in the following places, always surrounded by parentheses:
+
+- The FROM clause: `FROM (<subquery>)`.
+- As inputs to a JOIN: `<table-or-subquery-1> t1 INNER JOIN <table-or-subquery-2> t2 ON t1.<col1> = t2.<col2>`.
+- In the WHERE clause: `WHERE <column> { IN | NOT IN } (<subquery>)`. These are translated to joins by the SQL planner.
+
+> Performance tip: Subquery results need to be fully transferred to the Broker as part of query execution.
+> This means that subqueries with large result sets can cause performance bottlenecks or run into memory usage limits.
+> See the [Query execution](query-execution.md) documentation for more details on how subqueries are executed and what
+> limits will apply.
+
+### `join`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Joins "sales" with "countries" (using "store" as the join key) to get sales by country.
+SELECT
+  countries.v AS country,
+  SUM(sales.revenue) AS country_revenue
+FROM
+  sales
+  INNER JOIN lookup.store_to_country ON sales.store = store_to_country.k
+GROUP BY
+  countries.v
+```
+<!--Native-->
 ```json
 {
-	"type": "table",
-	"name": "<string_value>"
+  "queryType": "groupBy",
+  "dataSource": {
+    "type": "join",
+    "left": "sales",
+    "right": {
+      "type": "lookup",
+      "lookup": "store_to_country"
+    },
+    "rightPrefix": "r.",
+    "condition": "store == \"r.k\"",
+    "joinType": "INNER"
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "dimensions": [
+    { "type": "default", "outputName": "country", "dimension": "r.v" }
+  ],
+  "aggregations": [
+    { "type": "longSum", "name": "country_revenue", "fieldName": "revenue" }
+  ]
 }
 ```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Join datasources allow you to do a SQL-style join of two datasources. Stacking joins on top of each other allows
+you to join arbitrarily many datasources.
+
+In Druid {{DRUIDVERSION}}, joins are implemented with a broadcast hash-join algorithm. This means that all tables
+other than the leftmost "base" table must fit in memory. It also means that the join condition must be an equality. This
+feature is intended mainly to allow joining regular Druid tables with [lookup](#lookup), [inline](#inline), and
+[query](#query) datasources.
+
+For information about how Druid executes queries involving joins, refer to the
+[Query execution](query-execution.html#join) page.
+
+#### Joins in SQL
+
+SQL joins take the form:
+
+```
+<o1> [ INNER | LEFT [OUTER] ] JOIN <o2> ON <condition>
+```
+
+The condition must involve only equalities, but functions are okay, and there can be multiple equalities ANDed together.
+Conditions like `t1.x = t2.x`, or `LOWER(t1.x) = t2.x`, or `t1.x = t2.x AND t1.y = t2.y` can all be handled. Conditions
+like `t1.x <> t2.x` cannot currently be handled.
 
-### Union datasource
+Note that Druid SQL is less rigid than what native join datasources can handle. In cases where a SQL query does
+something that is not allowed as-is with a native join datasource, Druid SQL will generate a subquery. This can have
+a substantial effect on performance and scalability, so it is something to watch out for. Some examples of when the
+SQL layer will generate subqueries include:
 
-This data source unions two or more table data sources.
+- Joining a regular Druid table to itself, or to another regular Druid table. The native join datasource can accept
+a table on the left-hand side, but not the right, so a subquery is needed.
 
+- Join conditions where the expressions on either side are of different types.
+
+- Join conditions where the right-hand expression is not a direct column access.
+
+For more information about how Druid translates SQL to native queries, refer to the
+[Druid SQL](sql.md#query-translation) documentation.
+
+#### Joins in native queries
+
+Native join datasources have the following properties. All are required.
+
+|Field|Description|
+|-----|-----------|
+|`left`|Left-hand datasource. Must be of type `table`, `join`, `lookup`, `query`, or `inline`. Placing another join as the left datasource allows you to join arbitrarily many datasources.|
+|`right`|Right-hand datasource. Must be of type `lookup`, `query`, or `inline`. Note that this is more rigid than what Druid SQL requires.|
+|`rightPrefix`|String prefix that will be applied to all columns from the right-hand datasource, to prevent them from colliding with columns from the left-hand datasource. Can be any string, so long as it is nonempty and is not be a prefix of the string `__time`. Any columns from the left-hand side that start with your `rightPrefix` will be shadowed. It is up to you to provide a prefix that will not shadow any important columns from the left side.|
+|`condition`|[Expression](../misc/math-expr.md) that must be an equality where one side is an expression of the left-hand side, and the other side is a simple column reference to the right-hand side. Note that this is more rigid than what Druid SQL requires: here, the right-hand reference must be a simple column reference; in SQL it can be an expression.|
+|`joinType`|`INNER` or `LEFT`.|
+
+#### Join performance
+
+Joins are a feature that can significantly affect performance of your queries. Some performance tips and notes:
+
+1. Joins are especially useful with [lookup datasources](#lookup), but in most cases, the
+[`LOOKUP` function](sql.html#string-functions) performs better than a join. Consider using the `LOOKUP` function if
+it is appropriate for your use case.
+2. When using joins in Druid SQL, keep in mind that it can generate subqueries that you did not explicitly include in
+your queries. Refer to the [Druid SQL](sql.md#query-translation) documentation for more details about when this happens
+and how to detect it.
+3. One common reason for implicit subquery generation is if the types of the two halves of an equality do not match.
+For example, since lookup keys are always strings, the condition `druid.d JOIN lookup.l ON d.field = l.field` will
+perform best if `d.field` is a string.
+4. As of Druid {{DRUIDVERSION}}, the join operator must evaluate the condition for each row. In the future, we expect
+to implement both early and deferred condition evaluation, which we expect to improve performance considerably for
+common use cases.
+
+#### Future work for joins
+
+Joins are an area of active development in Druid. The following features are missing today but may appear in
+future versions:
+
+- Preloaded dimension tables that are wider than lookups (i.e. supporting more than a single key and single value).
+- RIGHT OUTER and FULL OUTER joins. Currently, they are partially implemented. Queries will run but results will not
+always be correct.
+- Performance-related optimizations as mentioned in the [previous section](#join-performance).
+- Join algorithms other than broadcast hash-joins.
+
+### `union`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--Native-->
 ```json
 {
-       "type": "union",
-       "dataSources": ["<string_value1>", "<string_value2>", "<string_value3>", ... ]
+  "queryType": "scan",
+  "dataSource": {
+    "type": "union",
+    "dataSources": ["<dataSourceName1>", "<dataSourceName2>", "<dataSourceName3>"]
 
 Review comment:
   nit:
   ```suggestion
       "dataSources": ["<tableDataSourceName1>", "<tableDataSourceName2>", "<tableDataSourceName3>"]
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
gianm commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409087255
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always creating or writing into a table
+datasource. But at query time, there are many other types of datasources available.
 
-A data source is the Apache Druid equivalent of a database table. However, a query can also masquerade as a data source, providing subquery-like functionality. Query data sources are currently supported only by [GroupBy](../querying/groupbyquery.md) queries.
+In the [Druid SQL](sql.html) language, datasources are provided in the [`FROM` clause](sql.html#from).
 
 Review comment:
   Joins are still something you provide in the FROM clause. Right?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
suneet-s commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r408973395
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
 
 Review comment:
   nit: singular?
   ```suggestion
   A datasource in Apache Druid is a thing that you can query. The most common kind of datasource is a table datasource,
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] sthetland commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
sthetland commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r408929947
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always creating or writing into a table
+datasource. But at query time, there are many other types of datasources available.
 
-A data source is the Apache Druid equivalent of a database table. However, a query can also masquerade as a data source, providing subquery-like functionality. Query data sources are currently supported only by [GroupBy](../querying/groupbyquery.md) queries.
+In the [Druid SQL](sql.html) language, datasources are provided in the [`FROM` clause](sql.html#from).
 
-### Table datasource
-The table data source is the most common type. It's represented by a string, or by the full structure:
+The word "datasource" is generally spelled `dataSource` (with a capital S) when it appears in API requests and
+responses.
 
+## Datasource type
+
+### `table`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT column1, column2 FROM "druid"."dataSourceName"
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": "dataSourceName",
+  "columns": ["column1", "column2"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+The table datasource is the most common type. This is the kind of datasource you get when you perform
+[data ingestion](../ingestion/index.html). They are split up into segments, distributed around the cluster,
+and queried in parallel.
+
+In [Druid SQL](sql.html#from), table datasources reside in the the `druid` schema. This is the default schema, so table
 
 Review comment:
   ```suggestion
   In [Druid SQL](sql.html#from), table datasources reside in the `druid` schema. This is the default schema, so table
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
suneet-s commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409034804
 
 

 ##########
 File path: docs/querying/datasource.md
 ##########
 @@ -22,43 +22,317 @@ title: "Datasources"
   ~ under the License.
   -->
 
+Datasources in Apache Druid are things that you can query. The most common kind of datasource is a table datasource,
+and in many contexts the word "datasource" implicitly refers to table datasources. This is especially true
+[during data ingestion](../ingestion/index.html), where ingestion is always creating or writing into a table
+datasource. But at query time, there are many other types of datasources available.
 
-A data source is the Apache Druid equivalent of a database table. However, a query can also masquerade as a data source, providing subquery-like functionality. Query data sources are currently supported only by [GroupBy](../querying/groupbyquery.md) queries.
+In the [Druid SQL](sql.html) language, datasources are provided in the [`FROM` clause](sql.html#from).
 
-### Table datasource
-The table data source is the most common type. It's represented by a string, or by the full structure:
+The word "datasource" is generally spelled `dataSource` (with a capital S) when it appears in API requests and
+responses.
 
+## Datasource type
+
+### `table`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT column1, column2 FROM "druid"."dataSourceName"
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": "dataSourceName",
+  "columns": ["column1", "column2"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+The table datasource is the most common type. This is the kind of datasource you get when you perform
+[data ingestion](../ingestion/index.html). They are split up into segments, distributed around the cluster,
+and queried in parallel.
+
+In [Druid SQL](sql.html#from), table datasources reside in the the `druid` schema. This is the default schema, so table
+datasources can be referenced as either `druid.dataSourceName` or simply `dataSourceName`.
+
+In native queries, table datasources can be referenced using their names as strings (as in the example above), or by
+using JSON objects of the form:
+
+```json
+"dataSource": {
+  "type": "table",
+  "name": "dataSourceName"
+}
+```
+
+To see a list of all table datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'druid'`.
+
+### `lookup`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+SELECT k, v FROM lookup.countries
+```
+<!--Native-->
+```json
+{
+  "queryType": "scan",
+  "dataSource": {
+    "type": "lookup",
+    "lookup": "countries"
+  },
+  "columns": ["k", "v"],
+  "intervals": ["0000/3000"]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Lookup datasources correspond to Druid's key-value [lookup](lookups.html) objects. In [Druid SQL](sql.html#from),
+they reside in the the `lookup` schema. They are preloaded in memory on all servers, so they can be accessed rapidly.
+They can be joined onto regular tables using the [join operator](#join).
+
+Lookup datasources are key-value oriented and always have exactly two columns: `k` (the key) and `v` (the value), and
+both are always strings.
+
+To see a list of all lookup datasources, use the SQL query
+`SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = 'lookup'`.
+
+> Performance tip: Lookups can be joined with a base table either using an explicit [join](#join), or by using the
+> SQL [`LOOKUP` function](sql.html#string-functions).
+> However, the join operator must evaluate the condition on each row, whereas the
+> `LOOKUP` function can defer evaluation until after an aggregation phase. This means that the `LOOKUP` function is
+> usually faster than joining to a lookup datasource.
+
+### `query`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Uses a subquery to count hits per page, then takes the average.
+SELECT
+  AVG(cnt) AS average_hits_per_page
+FROM
+  (SELECT page, COUNT(*) AS hits FROM site_traffic GROUP BY page)
+```
+<!--Native-->
+```json
+{
+  "queryType": "timeseries",
+  "dataSource": {
+    "type": "query",
+    "query": {
+      "queryType": "groupBy",
+      "dataSource": "site_traffic",
+      "intervals": ["0000/3000"],
+      "granularity": "all",
+      "dimensions": ["page"],
+      "aggregations": [
+        { "type": "count", "name": "hits" }
+      ]
+    }
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "aggregations": [
+    { "type": "longSum", "name": "hits", "fieldName": "hits" },
+    { "type": "count", "name": "pages" }
+  ],
+  "postAggregations": [
+    { "type": "expression", "name": "average_hits_per_page", "expression": "hits / pages" }
+  ]
+}
+```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Query datasources allow you to issue subqueries. In native queries, they can appear anywhere that accepts a
+`dataSource`. In SQL, they can appear in the following places, always surrounded by parentheses:
+
+- The FROM clause: `FROM (<subquery>)`.
+- As inputs to a JOIN: `<table-or-subquery-1> t1 INNER JOIN <table-or-subquery-2> t2 ON t1.<col1> = t2.<col2>`.
+- In the WHERE clause: `WHERE <column> { IN | NOT IN } (<subquery>)`. These are translated to joins by the SQL planner.
+
+> Performance tip: Subquery results need to be fully transferred to the Broker as part of query execution.
+> This means that subqueries with large result sets can cause performance bottlenecks or run into memory usage limits.
+> See the [Query execution](query-execution.md) documentation for more details on how subqueries are executed and what
+> limits will apply.
+
+### `join`
+
+<!--DOCUSAURUS_CODE_TABS-->
+<!--SQL-->
+```sql
+-- Joins "sales" with "countries" (using "store" as the join key) to get sales by country.
+SELECT
+  countries.v AS country,
+  SUM(sales.revenue) AS country_revenue
+FROM
+  sales
+  INNER JOIN lookup.store_to_country ON sales.store = store_to_country.k
+GROUP BY
+  countries.v
+```
+<!--Native-->
 ```json
 {
-	"type": "table",
-	"name": "<string_value>"
+  "queryType": "groupBy",
+  "dataSource": {
+    "type": "join",
+    "left": "sales",
+    "right": {
+      "type": "lookup",
+      "lookup": "store_to_country"
+    },
+    "rightPrefix": "r.",
+    "condition": "store == \"r.k\"",
+    "joinType": "INNER"
+  },
+  "intervals": ["0000/3000"],
+  "granularity": "all",
+  "dimensions": [
+    { "type": "default", "outputName": "country", "dimension": "r.v" }
+  ],
+  "aggregations": [
+    { "type": "longSum", "name": "country_revenue", "fieldName": "revenue" }
+  ]
 }
 ```
+<!--END_DOCUSAURUS_CODE_TABS-->
+
+Join datasources allow you to do a SQL-style join of two datasources. Stacking joins on top of each other allows
+you to join arbitrarily many datasources.
+
+In Druid {{DRUIDVERSION}}, joins are implemented with a broadcast hash-join algorithm. This means that all tables
+other than the leftmost "base" table must fit in memory. It also means that the join condition must be an equality. This
+feature is intended mainly to allow joining regular Druid tables with [lookup](#lookup), [inline](#inline), and
+[query](#query) datasources.
+
+For information about how Druid executes queries involving joins, refer to the
+[Query execution](query-execution.html#join) page.
+
+#### Joins in SQL
+
+SQL joins take the form:
+
+```
+<o1> [ INNER | LEFT [OUTER] ] JOIN <o2> ON <condition>
+```
+
+The condition must involve only equalities, but functions are okay, and there can be multiple equalities ANDed together.
+Conditions like `t1.x = t2.x`, or `LOWER(t1.x) = t2.x`, or `t1.x = t2.x AND t1.y = t2.y` can all be handled. Conditions
+like `t1.x <> t2.x` cannot currently be handled.
 
-### Union datasource
+Note that Druid SQL is less rigid than what native join datasources can handle. In cases where a SQL query does
+something that is not allowed as-is with a native join datasource, Druid SQL will generate a subquery. This can have
+a substantial effect on performance and scalability, so it is something to watch out for. Some examples of when the
+SQL layer will generate subqueries include:
 
-This data source unions two or more table data sources.
+- Joining a regular Druid table to itself, or to another regular Druid table. The native join datasource can accept
+a table on the left-hand side, but not the right, so a subquery is needed.
 
+- Join conditions where the expressions on either side are of different types.
+
+- Join conditions where the right-hand expression is not a direct column access.
+
+For more information about how Druid translates SQL to native queries, refer to the
+[Druid SQL](sql.md#query-translation) documentation.
+
+#### Joins in native queries
+
+Native join datasources have the following properties. All are required.
+
+|Field|Description|
+|-----|-----------|
+|`left`|Left-hand datasource. Must be of type `table`, `join`, `lookup`, `query`, or `inline`. Placing another join as the left datasource allows you to join arbitrarily many datasources.|
+|`right`|Right-hand datasource. Must be of type `lookup`, `query`, or `inline`. Note that this is more rigid than what Druid SQL requires.|
+|`rightPrefix`|String prefix that will be applied to all columns from the right-hand datasource, to prevent them from colliding with columns from the left-hand datasource. Can be any string, so long as it is nonempty and is not be a prefix of the string `__time`. Any columns from the left-hand side that start with your `rightPrefix` will be shadowed. It is up to you to provide a prefix that will not shadow any important columns from the left side.|
+|`condition`|[Expression](../misc/math-expr.md) that must be an equality where one side is an expression of the left-hand side, and the other side is a simple column reference to the right-hand side. Note that this is more rigid than what Druid SQL requires: here, the right-hand reference must be a simple column reference; in SQL it can be an expression.|
+|`joinType`|`INNER` or `LEFT`.|
+
+#### Join performance
+
+Joins are a feature that can significantly affect performance of your queries. Some performance tips and notes:
+
+1. Joins are especially useful with [lookup datasources](#lookup), but in most cases, the
+[`LOOKUP` function](sql.html#string-functions) performs better than a join. Consider using the `LOOKUP` function if
+it is appropriate for your use case.
+2. When using joins in Druid SQL, keep in mind that it can generate subqueries that you did not explicitly include in
+your queries. Refer to the [Druid SQL](sql.md#query-translation) documentation for more details about when this happens
+and how to detect it.
+3. One common reason for implicit subquery generation is if the types of the two halves of an equality do not match.
+For example, since lookup keys are always strings, the condition `druid.d JOIN lookup.l ON d.field = l.field` will
+perform best if `d.field` is a string.
+4. As of Druid {{DRUIDVERSION}}, the join operator must evaluate the condition for each row. In the future, we expect
+to implement both early and deferred condition evaluation, which we expect to improve performance considerably for
+common use cases.
+
+#### Future work for joins
+
+Joins are an area of active development in Druid. The following features are missing today but may appear in
+future versions:
+
+- Preloaded dimension tables that are wider than lookups (i.e. supporting more than a single key and single value).
+- RIGHT OUTER and FULL OUTER joins. Currently, they are partially implemented. Queries will run but results will not
 
 Review comment:
   Is it possible to issue these queries today? I tried doing it from SQL and the query wouldn't plan. IIRC native queries also do not allow these conditions.
   
   Maybe something like 
   CROSS joins are supported via sql, but it does so using sub-queries, so it really should not be used in production use cases

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on a change in pull request #9704: Refresh query docs.

Posted by GitBox <gi...@apache.org>.
suneet-s commented on a change in pull request #9704: Refresh query docs.
URL: https://github.com/apache/druid/pull/9704#discussion_r409077692
 
 

 ##########
 File path: docs/querying/query-context.md
 ##########
 @@ -46,26 +55,28 @@ The query context is used for various query configuration parameters. The follow
 |parallelMergeInitialYieldRows|`druid.processing.merge.task.initialYieldNumRows`|Number of rows to yield per ForkJoinPool merge task for parallel result merging on the Broker, before forking off a new task to continue merging sequences. See [Broker configuration](../configuration/index.html#broker) for more details.|
 |parallelMergeSmallBatchRows|`druid.processing.merge.task.smallBatchNumRows`|Size of result batches to operate on in ForkJoinPool merge tasks for parallel result merging on the Broker. See [Broker configuration](../configuration/index.html#broker) for more details.|
 
+## Query-type-specific parameters
 
 Review comment:
   Should there be a section for datsource specific parameters? To talk about query contexts for join queries (filer pushdown ,etc.) or is that covered in another section

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org