You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/08/16 19:07:35 UTC

[GitHub] [druid] SharuBob opened a new issue #11603: Druid SQL: Query with Count(*) is not returning actual number of rows with HAVING clause on derived columns

SharuBob opened a new issue #11603:
URL: https://github.com/apache/druid/issues/11603


   Druid SQL: Query with Count(*) is not returning actual number of rows with HAVING clause on derived columns
   
   ### Affected Version
   0.21.0
   
   ### Description
   
   **We are using Druid for Ads Analytics and reporting. Storing all the advertisement dimensions and metrics, such as advertiserID, campaignID, clicks, impressions etc.**
   
   _I'm using Druid SQL for Querying not the native Query:_
   The use case is for supporting pagination, we need total count of the rows returned so can use DRUID SQL LIMIT and OFFSET for pagination.
   
   In order to get the total count for the given query we are using `SELECT COUNT(*) FROM (SUBQUERY) SOURCE`
   
   **Issue: The total count returned is not the actual number of rows for the actual query.**
   
   **### Example:**
   
   The below Query would return total 6 rows, 
   
   `SELECT advertiser_id, campaign_id, city_code, matched_keywords, DATE_TRUNC('year', __time) AS Year_, 
   SUM(impression) AS impression, SUM(click) AS click, (CAST(SUM(click) as FLOAT)/CAST(SUM(impression) as FLOAT))*100 AS CTR_, 
   COUNT(*) AS counts FROM ad_fact_v1 WHERE __time >= '2018-09-01T18:05:39Z' AND __time < '2021-08-16T18:05:39Z' AND 
   advertiser_id IN ('290641618220892166') AND campaign_id IN ('347205814932489837') GROUP BY 1, 2, 3, 4, 5 HAVING click >= 100 AND CTR_ <= 4.90 ORDER BY CTR_ ASC`
   
   Equivalent native query using explain:
   
   `{
   	"queryType": "groupBy",
   	"dataSource": {
   		"type": "table",
   		"name": "ad_fact_v1"
   	},
   	"intervals": {
   		"type": "intervals",
   		"intervals": [
   			"2018-09-01T18:05:39.000Z/2021-08-16T18:05:39.000Z"
   		]
   	},
   	"virtualColumns": [{
   		"type": "expression",
   		"name": "v2",
   		"expression": "timestamp_floor(\"__time\",'P1Y',null,'UTC')",
   		"outputType": "LONG"
   	}],
   	"filter": {
   		"type": "and",
   		"fields": [{
   				"type": "selector",
   				"dimension": "advertiser_id",
   				"value": "290641618220892166",
   				"extractionFn": null
   			},
   			{
   				"type": "selector",
   				"dimension": "campaign_id",
   				"value": "347205814932489837",
   				"extractionFn": null
   			}
   		]
   	},
   	"granularity": {
   		"type": "all"
   	},
   	"dimensions": [{
   			"type": "default",
   			"dimension": "city_code",
   			"outputName": "d2",
   			"outputType": "STRING"
   		},
   		{
   			"type": "default",
   			"dimension": "matched_keywords",
   			"outputName": "d3",
   			"outputType": "STRING"
   		},
   		{
   			"type": "default",
   			"dimension": "v2",
   			"outputName": "d4",
   			"outputType": "LONG"
   		}
   	],
   	"aggregations": [{
   			"type": "longSum",
   			"name": "a0",
   			"fieldName": "impression",
   			"expression": null
   		},
   		{
   			"type": "longSum",
   			"name": "a1",
   			"fieldName": "click",
   			"expression": null
   		},
   		{
   			"type": "count",
   			"name": "a2"
   		}
   	],
   	"postAggregations": [{
   			"type": "expression",
   			"name": "p0",
   			"expression": "'290641618220892166'",
   			"ordering": null
   		},
   		{
   			"type": "expression",
   			"name": "p1",
   			"expression": "'347205814932489837'",
   			"ordering": null
   		},
   		{
   			"type": "expression",
   			"name": "p2",
   			"expression": "((CAST(\"a1\", 'DOUBLE') / CAST(\"a0\", 'DOUBLE')) * 100)",
   			"ordering": null
   		}
   	],
   	"having": {
   		"type": "filter",
   		"filter": {
   			"type": "and",
   			"fields": [{
   					"type": "bound",
   					"dimension": "a1",
   					"lower": "100",
   					"upper": null,
   					"lowerStrict": false,
   					"upperStrict": false,
   					"extractionFn": null,
   					"ordering": {
   						"type": "numeric"
   					}
   				},
   				{
   					"type": "expression",
   					"expression": "(((CAST(\"a1\", 'DOUBLE') / CAST(\"a0\", 'DOUBLE')) * 100) <= 4.90)"
   				}
   			]
   		},
   		"finalize": true
   	},
   	"limitSpec": {
   		"type": "default",
   		"columns": [{
   			"dimension": "p2",
   			"direction": "ascending",
   			"dimensionOrder": {
   				"type": "numeric"
   			}
   		}],
   		"limit": 100
   	},
   	"context": {
   		"sqlOuterLimit": 100,
   		"sqlQueryId": "39b9554f-e4d6-4b80-b106-71bba7f17d05"
   	},
   	"descending": false
   }`
   
   **So, when I use below query expect the count to be 6. But it seems not apply HAVING clause here when fetching count.**
   
   `SELECT COUNT(*) AS totalCounts FROM (SELECT advertiser_id, campaign_id, city_code, matched_keywords, DATE_TRUNC('year', __time) AS Year_, 
   SUM(impression) AS impression, SUM(click) AS click, (CAST(SUM(click) as FLOAT)/CAST(SUM(impression) as FLOAT))*100 AS CTR_, 
   COUNT(*) AS counts FROM ad_fact_v1 WHERE __time >= '2018-09-01T18:05:39Z' AND __time < '2021-08-16T18:05:39Z' AND 
   advertiser_id IN ('290641618220892166') AND campaign_id IN ('347205814932489837') GROUP BY 1, 2, 3, 4, 5 HAVING click >= 100 AND CTR_ <= 4.90 ORDER BY CTR_ ASC) source`
   
   Equivalent native query using explain:
   
   `DruidOuterQueryRel(query = [{
   	"queryType": "timeseries",
   	"dataSource": {
   		"type": "table",
   		"name": "__subquery__"
   	},
   	"intervals": {
   		"type": "intervals",
   		"intervals": ["-146136543-09-08T08:23:32.096Z/146140482-04-24T15:36:27.903Z"]
   	},
   	"descending": false,
   	"virtualColumns": [{
   		"type": "expression",
   		"name": "v0",
   		"expression": "((CAST(\"click\", 'DOUBLE') / CAST(\"impression\", 'DOUBLE')) * 100)",
   		"outputType": "FLOAT"
   	}],
   	"filter": {
   		"type": "and",
   		"fields": [{
   			"type": "bound",
   			"dimension": "click",
   			"lower": "100",
   			"upper": null,
   			"lowerStrict": false,
   			"upperStrict": false,
   			"extractionFn": null,
   			"ordering": {
   				"type": "numeric"
   			}
   		}, {
   			"type": "bound",
   			"dimension": "v0",
   			"lower": null,
   			"upper": "4.90",
   			"lowerStrict": false,
   			"upperStrict": false,
   			"extractionFn": null,
   			"ordering": {
   				"type": "numeric"
   			}
   		}]
   	},
   	"granularity": {
   		"type": "all"
   	},
   	"aggregations": [{
   		"type": "count",
   		"name": "a0"
   	}],
   	"postAggregations": [],
   	"limit": 2147483647,
   	"context": {
   		"skipEmptyBuckets": true,
   		"sqlOuterLimit": 100,
   		"sqlQueryId": "c35534f5-38dc-4747-8812-36d2c063dd74"
   	}
   }], signature = [{
   	a0: LONG
   }]) DruidQueryRel(query = [{
   	"queryType": "groupBy",
   	"dataSource": {
   		"type": "table",
   		"name": "ad_fact_v1"
   	},
   	"intervals": {
   		"type": "intervals",
   		"intervals": ["2018-09-01T18:05:39.000Z/2021-08-16T18:05:39.000Z"]
   	},
   	"virtualColumns": [{
   		"type": "expression",
   		"name": "v0",
   		"expression": "'290641618220892166'",
   		"outputType": "STRING"
   	}, {
   		"type": "expression",
   		"name": "v1",
   		"expression": "'347205814932489837'",
   		"outputType": "STRING"
   	}, {
   		"type": "expression",
   		"name": "v2",
   		"expression": "timestamp_floor(\"__time\",'P1Y',null,'UTC')",
   		"outputType": "LONG"
   	}],
   	"filter": {
   		"type": "and",
   		"fields": [{
   			"type": "selector",
   			"dimension": "advertiser_id",
   			"value": "290641618220892166",
   			"extractionFn": null
   		}, {
   			"type": "selector",
   			"dimension": "campaign_id",
   			"value": "347205814932489837",
   			"extractionFn": null
   		}]
   	},
   	"granularity": {
   		"type": "all"
   	},
   	"dimensions": [{
   		"type": "default",
   		"dimension": "v0",
   		"outputName": "d0",
   		"outputType": "STRING"
   	}, {
   		"type": "default",
   		"dimension": "v1",
   		"outputName": "d1",
   		"outputType": "STRING"
   	}, {
   		"type": "default",
   		"dimension": "city_code",
   		"outputName": "d2",
   		"outputType": "STRING"
   	}, {
   		"type": "default",
   		"dimension": "matched_keywords",
   		"outputName": "d3",
   		"outputType": "STRING"
   	}, {
   		"type": "default",
   		"dimension": "v2",
   		"outputName": "d4",
   		"outputType": "LONG"
   	}],
   	"aggregations": [{
   		"type": "longSum",
   		"name": "a0",
   		"fieldName": "impression",
   		"expression": null
   	}, {
   		"type": "longSum",
   		"name": "a1",
   		"fieldName": "click",
   		"expression": null
   	}],
   	"postAggregations": [],
   	"having": null,
   	"limitSpec": {
   		"type": "NoopLimitSpec"
   	},
   	"context": {
   		"sqlOuterLimit": 100,
   		"sqlQueryId": "c35534f5-38dc-4747-8812-36d2c063dd74"
   	},
   	"descending": false
   }], signature = [{
   	d0: STRING,
   	d1: STRING,
   	d2: STRING,
   	d3: STRING,
   	d4: LONG,
   	a0: LONG,
   	a1: LONG
   }])`
   
   
   
   
   
   Debug Done:
   
   - Looks like derived aggregation like **(CAST(SUM(click) as FLOAT)/CAST(SUM(impression) as FLOAT))*100 AS CTR_** is not considered while doing the count(*).
   - count(*) is using _"queryType": "timeseries"_ where as without count(*) uses _"queryType": "groupBy"_. Is that possible reason?
   
   Any help would be highly appreciated, thank you in advance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] JRobTS commented on issue #11603: Druid SQL: Query with Count(*) is not returning actual number of rows with HAVING clause on derived columns

Posted by GitBox <gi...@apache.org>.
JRobTS commented on issue #11603:
URL: https://github.com/apache/druid/issues/11603#issuecomment-961180338


   I also see inconsistent results with CAST but without the subquery. This happens on the most recently ingested data (e.g. data ingested in the past 10 minutes)
   
   ```
   SELECT COALESCE(ReturnDate, 0)
           FROM "datasource"
           WHERE __time >= '2021-10-27T14:00:00+00:00'
           AND __time < '2021-10-27T15:00:00+00:00' 
           AND EventName = 'Completed' 
   ```
   --> 18640
   
   ```
   SELECT CAST(ReturnDate AS INTEGER)
           FROM "datasource"
           WHERE __time >= '2021-10-27T14:00:00+00:00'
           AND __time < '2021-10-27T15:00:00+00:00' 
           AND EventName = 'Completed' 
   ```
   --> 13375
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] JRobTS edited a comment on issue #11603: Druid SQL: Query with Count(*) is not returning actual number of rows with HAVING clause on derived columns

Posted by GitBox <gi...@apache.org>.
JRobTS edited a comment on issue #11603:
URL: https://github.com/apache/druid/issues/11603#issuecomment-961180338


   I also see inconsistent results with CAST but without the subquery. This happens on the most recently ingested data (e.g. data ingested in the past 10 minutes)
   
   ```
   SELECT COUNT(*)
           FROM "datasource"
           WHERE __time >= '2021-10-27T14:00:00+00:00'
           AND __time < '2021-10-27T15:00:00+00:00' 
           AND EventName = 'Completed' 
   ```
   --> 18640
   
   ```
   SELECT COALESCE(ReturnDate, 0)
           FROM "datasource"
           WHERE __time >= '2021-10-27T14:00:00+00:00'
           AND __time < '2021-10-27T15:00:00+00:00' 
           AND EventName = 'Completed' 
   ```
   --> 18640
   
   ```
   SELECT CAST(ReturnDate AS INTEGER)
           FROM "datasource"
           WHERE __time >= '2021-10-27T14:00:00+00:00'
           AND __time < '2021-10-27T15:00:00+00:00' 
           AND EventName = 'Completed' 
   ```
   --> 13375
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org