You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/05/04 17:25:49 UTC

[GitHub] [druid] benkrug opened a new issue #11197: incorrect druid SQL results using GROUP BY and COUNT

benkrug opened a new issue #11197:
URL: https://github.com/apache/druid/issues/11197


   Some druid SQL queries including GROUP BY and COUNT return incorrect results.  I'm including two examples.
   
   ### Affected Version
   
   Seen in 0.20.0 and 0.21.0.  The second example below apparently doesn't reproduce in 0.18.0.  I haven't tested the first on 0.18.0.
   
   ### Description
   
   Particular GROUP BY queries are returning 0 counts when they should return non-zero numbers.  Providing two examples, one with wikipedia, and one with a custom data set (attached).
   
   With wikipedia, we can see the following, to zero in on the issue:
   
   This query returns various non-zero numbers for the counts:
   
   ```
   select comment, count(*) FROM wikipedia
   WHERE channel = '#ar.wikipedia'
   GROUP BY comment
   ```
   
   This query will return 1's, which makes sense:
   
   ```
   select comment, count(distinct(channel)) FROM wikipedia
   WHERE channel = '#ar.wikipedia'
   GROUP BY comment
   ```
   
   So, we have data for the channel field (which makes sense, since we filter for a channel), but the following query returns 0's for all the counts, which is not only incorrect, but logically impossible given the query:
   
   ```
   select comment, count(channel) FROM wikipedia
   WHERE channel = '#ar.wikipedia'
   GROUP BY comment
   ```
   
   The second example adds a twist, I'm not sure of the reason.  With the given dataset (data file and spec attached), the following query also incorrectly returns all 0's:
   
   ```
   SELECT ctnsappname, count(ind_id) FROM testData 
   WHERE ind_id = 3006
   AND __time >= '2000-01-01'
   AND __time < '3000-01-01'
   GROUP BY ctnsappname
   ```
   
   Note that all the __time values are from 2021-01-31, so all data is included in the time filters.  However (the twist), if you remove the time filters, you seem to get correct results.  I didn't test with 0.18.0, but Jie Zang reported that for this query, results were correct in 0.18.0.  He also noted that in 0.18.0, the explain uses a virtual column, but in 0.21.0, the virtual column entry (for v0) is empty.  I saw the same in 0.20.0.
   
   
   
   [ingestdata.txt](https://github.com/apache/druid/files/6422782/ingestdata.txt)
   [ingestspec.txt](https://github.com/apache/druid/files/6422783/ingestspec.txt)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org