You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Stefán Baxter <st...@activitystream.com> on 2015/07/30 16:14:32 UTC

Count where or having clause does not work as expected

Hi,

I have data that can be reduced to this:

   - {"client_ip":{"country_code":"US"}}
   - {"client_ip":{"country_code":"US"}}
   - {"client_ip":{"country_code":"US"}}
   - {"client_ip":{"country_code":"GB"}}
   - {"client_ip":{"country_code":"US"}}

This works fine:

select p.client_ip.country_code country_code, count(*) event_count from
dfs.tmp.`/events` as p group by p.client_ip.country_code;


+---------+--------------+
| country_code  | event_count  |
+---------+--------------+
| US      | 21516        |
| GB      | 323594       |
+---------+--------------+


This simple where clause returns nothing:

select p.client_ip.country_code country_code, count(*) event_count from
dfs.tmp.`/events` as p where event_count > 30000 group by
p.client_ip.country_code;
+---------------+--------------+
| country_code  | event_count  |
+---------------+--------------+
+---------------+--------------+


Using a having clause (v1):

select p.client_ip.country_code country_code, count(*) event_count from
dfs.tmp.`/events` as p group by p.client_ip.country_code having event_count
> 30000;

Error: PARSE ERROR: From line 1, column 169 to line 1, column 179:
Expression 'event_count' is not being grouped


Using a having clause (v2):

select p.client_ip.country_code country_code, count(*) event_count from
dfs.tmp.`/events` as p group by p.client_ip.country_code having count(*) >
30000;
+---------------+--------------+
| country_code  | event_count  |
+---------------+--------------+
| GB            | 323594       |
| null*          | 566667       |
+---------------+--------------+


* there are no null values in the database and this looks to be the "total
for the rest"


Does anyone know how this can be made to work?

Regards,
 -Stefan

Re: Count where or having clause does not work as expected

Posted by Jinfeng Ni <ji...@gmail.com>.
Good to know now it works.

Regarding the support of alias in having clause, we had some discussion
before. In a schema-aware system, user normally would see the query using
the alias fails at validation, since the alias column does not exist in the
table.  This will prevent them from getting "surprising" result, just as
what you got.

In a schema-less system like Drill, it is more likely to encounter this
kind of "surprising" problem, since Drill will simply assume the
"event_count" reference is a regular column from the table.  We probably
should enhance the SQL planner to add the support of alias in having
clause.


On Thu, Jul 30, 2015 at 9:16 AM, Stefán Baxter <st...@activitystream.com>
wrote:

> thank you, I got it :).
>
> Completely understanding the Having clause "limitation" and just missing
> the column aliasing a bit.
> When I looked at the null results I was so sure that there were no nulls in
> the data-set that I did not bother to double check and just wrote the
> email.
>
> Thanks for the assistance,
>  -Stefan
>
>
> On Thu, Jul 30, 2015 at 4:11 PM, Jinfeng Ni <ji...@gmail.com> wrote:
>
> > This is related to using alias in having clause [1].  Drill does not
> allow
> > using alias in having clause.
> >
> >
> > Q1.
> > select p.client_ip.country_code country_code, count(*) event_count from
> > dfs.tmp.`/events` as p where event_count > 30000 group by
> > p.client_ip.country_code;
> >
> > In Q1, the event-count in the where clause is not refer to the alias
> > "event_count". In stead, it is resolved to refer to a column from the
> table
> > "events".  In the table, "event_count" does not exists, and is treated as
> > null. That's why you get 0 rows.
> >
> > Q2.
> > select p.client_ip.country_code country_code, count(*) event_count from
> > dfs.tmp.`/events` as p group by p.client_ip.country_code having
> event_count
> > > 30000;
> >
> > Again, "event_count" in having clause is not resolved to alias
> > "event_count" in select list. That's why SQL validation fail.
> >
> >
> > Q3
> > select p.client_ip.country_code country_code, count(*) event_count from
> > dfs.tmp.`/events` as p group by p.client_ip.country_code having count(*)
> >
> > 30000;
> >
> > This is the right approach.
> >
> > Regarding the null value and the total count, I tried with a slightly
> > modified input ( adding two rows with country_code as null).
> >
> > {"client_ip":{"country_code":"US"}}
> > {"client_ip":{"country_code":"US"}}
> > {"client_ip":{"country_code":"US"}}
> > {"client_ip":{"country_code":"GB"}}
> > {"client_ip":{"country_code":"US"}}
> > {"client_ip":{}}
> > {"client_ip":{}}
> >
> >
> > select p.client_ip.country_code country_code, count(*) event_count
> > from dfs.tmp.`events.json` as p
> > group by p.client_ip.country_code
> > having count(*) > 1;
> >
> > +---------------+--------------+
> > | country_code  | event_count  |
> > +---------------+--------------+
> > | US            | 4            |
> > | null          | 2            |
> > +---------------+--------------+
> > 2 rows selected (0.285 seconds)
> >
> > Looks like the query gets the correct answer.
> >
> >
> > [1]
> >
> >
> http://stackoverflow.com/questions/2068682/why-cant-i-use-alias-in-a-count-column-and-reference-it-in-a-having-clause
> >
> >
> > On Thu, Jul 30, 2015 at 8:40 AM, Stefán Baxter <
> stefan@activitystream.com>
> > wrote:
> >
> > > Hi,
> > >
> > > event_count is a column alias for count(*) (without the as).
> > >
> > > Regards,
> > >  -Stefan
> > >
> > > On Thu, Jul 30, 2015 at 3:32 PM, Jacques Nadeau <ja...@dremio.com>
> > > wrote:
> > >
> > > > select p.client_ip.country_code country_code, count(*) event_count
> from
> > > > dfs.tmp.`/events` as p where event_count > 30000 group by
> > > > p.client_ip.country_code;
> > > > +---------------+--------------+
> > > > | country_code  | event_count  |
> > > > +---------------+--------------+
> > > > +---------------+--------------+
> > > >
> > > > In this situation, a fully-schemaed system would have stated that the
> > > > event_count column does not exist.  In the case of json, we don't
> know
> > if
> > > > you're unlying data has an event_count column.  When we actually try
> to
> > > get
> > > > it, we find it doesn't exist and is therefore null.  Null is never
> > > greater
> > > > than 30000 so we exclude all records.  Modify your data to look like
> > > this:
> > > >
> > > > {event_count: 30001, "client_ip":{"country_code":"US"}}
> > > > {event_count: 30001, "client_ip":{"country_code":"US"}}
> > > > {event_count: 30001, "client_ip":{"country_code":"US"}}
> > > > {event_count: 30001, "client_ip":{"country_code":"GB"}}
> > > > {event_count: 30001, "client_ip":{"country_code":"US"}}
> > > >
> > > > And you'll see what your query is actually doing.  In SQL, an
> aggregate
> > > > function is applied after the WHERE clause.  The only way to apply a
> > > filter
> > > > after an aggregate is with the HAVING clause.
> > > >
> > > >
> > > >
> > > > --
> > > > Jacques Nadeau
> > > > CTO and Co-Founder, Dremio
> > > >
> > > > On Thu, Jul 30, 2015 at 8:20 AM, Andries Engelbrecht <
> > > > aengelbrecht@maprtech.com> wrote:
> > > >
> > > > > Last I checked group by and having clauses do not support column
> > > aliases.
> > > > >
> > > > > —Andries
> > > > >
> > > > >
> > > > > > On Jul 30, 2015, at 7:19 AM, Stefán Baxter <
> > > stefan@activitystream.com>
> > > > > wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > That last case works as expected, sorry, this test data does have
> > > null
> > > > > > values for country_code.
> > > > > >
> > > > > > That means that I have a working solution but that it would be
> nice
> > > if
> > > > v1
> > > > > > (above) would work.
> > > > > >
> > > > > > Thank you,
> > > > > > -Stefán
> > > > > >
> > > > > > On Thu, Jul 30, 2015 at 2:15 PM, Stefán Baxter <
> > > > > stefan@activitystream.com>
> > > > > > wrote:
> > > > > >
> > > > > >> make that "Count in where or having clause does not work as
> > > expected"
> > > > > >>
> > > > > >> On Thu, Jul 30, 2015 at 2:14 PM, Stefán Baxter <
> > > > > stefan@activitystream.com>
> > > > > >> wrote:
> > > > > >>
> > > > > >>> Hi,
> > > > > >>>
> > > > > >>> I have data that can be reduced to this:
> > > > > >>>
> > > > > >>>   - {"client_ip":{"country_code":"US"}}
> > > > > >>>   - {"client_ip":{"country_code":"US"}}
> > > > > >>>   - {"client_ip":{"country_code":"US"}}
> > > > > >>>   - {"client_ip":{"country_code":"GB"}}
> > > > > >>>   - {"client_ip":{"country_code":"US"}}
> > > > > >>>
> > > > > >>> This works fine:
> > > > > >>>
> > > > > >>> select p.client_ip.country_code country_code, count(*)
> > event_count
> > > > from
> > > > > >>> dfs.tmp.`/events` as p group by p.client_ip.country_code;
> > > > > >>>
> > > > > >>>
> > > > > >>> +---------+--------------+
> > > > > >>> | country_code  | event_count  |
> > > > > >>> +---------+--------------+
> > > > > >>> | US      | 21516        |
> > > > > >>> | GB      | 323594       |
> > > > > >>> +---------+--------------+
> > > > > >>>
> > > > > >>>
> > > > > >>> This simple where clause returns nothing:
> > > > > >>>
> > > > > >>> select p.client_ip.country_code country_code, count(*)
> > event_count
> > > > from
> > > > > >>> dfs.tmp.`/events` as p where event_count > 30000 group by
> > > > > >>> p.client_ip.country_code;
> > > > > >>> +---------------+--------------+
> > > > > >>> | country_code  | event_count  |
> > > > > >>> +---------------+--------------+
> > > > > >>> +---------------+--------------+
> > > > > >>>
> > > > > >>>
> > > > > >>> Using a having clause (v1):
> > > > > >>>
> > > > > >>> select p.client_ip.country_code country_code, count(*)
> > event_count
> > > > from
> > > > > >>> dfs.tmp.`/events` as p group by p.client_ip.country_code having
> > > > > event_count
> > > > > >>>> 30000;
> > > > > >>>
> > > > > >>> Error: PARSE ERROR: From line 1, column 169 to line 1, column
> > 179:
> > > > > >>> Expression 'event_count' is not being grouped
> > > > > >>>
> > > > > >>>
> > > > > >>> Using a having clause (v2):
> > > > > >>>
> > > > > >>> select p.client_ip.country_code country_code, count(*)
> > event_count
> > > > from
> > > > > >>> dfs.tmp.`/events` as p group by p.client_ip.country_code having
> > > > > count(*) >
> > > > > >>> 30000;
> > > > > >>> +---------------+--------------+
> > > > > >>> | country_code  | event_count  |
> > > > > >>> +---------------+--------------+
> > > > > >>> | GB            | 323594       |
> > > > > >>> | null*          | 566667       |
> > > > > >>> +---------------+--------------+
> > > > > >>>
> > > > > >>>
> > > > > >>> * there are no null values in the database and this looks to be
> > the
> > > > > >>> "total for the rest"
> > > > > >>>
> > > > > >>>
> > > > > >>> Does anyone know how this can be made to work?
> > > > > >>>
> > > > > >>> Regards,
> > > > > >>> -Stefan
> > > > > >>>
> > > > > >>
> > > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Count where or having clause does not work as expected

Posted by Stefán Baxter <st...@activitystream.com>.
thank you, I got it :).

Completely understanding the Having clause "limitation" and just missing
the column aliasing a bit.
When I looked at the null results I was so sure that there were no nulls in
the data-set that I did not bother to double check and just wrote the email.

Thanks for the assistance,
 -Stefan


On Thu, Jul 30, 2015 at 4:11 PM, Jinfeng Ni <ji...@gmail.com> wrote:

> This is related to using alias in having clause [1].  Drill does not allow
> using alias in having clause.
>
>
> Q1.
> select p.client_ip.country_code country_code, count(*) event_count from
> dfs.tmp.`/events` as p where event_count > 30000 group by
> p.client_ip.country_code;
>
> In Q1, the event-count in the where clause is not refer to the alias
> "event_count". In stead, it is resolved to refer to a column from the table
> "events".  In the table, "event_count" does not exists, and is treated as
> null. That's why you get 0 rows.
>
> Q2.
> select p.client_ip.country_code country_code, count(*) event_count from
> dfs.tmp.`/events` as p group by p.client_ip.country_code having event_count
> > 30000;
>
> Again, "event_count" in having clause is not resolved to alias
> "event_count" in select list. That's why SQL validation fail.
>
>
> Q3
> select p.client_ip.country_code country_code, count(*) event_count from
> dfs.tmp.`/events` as p group by p.client_ip.country_code having count(*) >
> 30000;
>
> This is the right approach.
>
> Regarding the null value and the total count, I tried with a slightly
> modified input ( adding two rows with country_code as null).
>
> {"client_ip":{"country_code":"US"}}
> {"client_ip":{"country_code":"US"}}
> {"client_ip":{"country_code":"US"}}
> {"client_ip":{"country_code":"GB"}}
> {"client_ip":{"country_code":"US"}}
> {"client_ip":{}}
> {"client_ip":{}}
>
>
> select p.client_ip.country_code country_code, count(*) event_count
> from dfs.tmp.`events.json` as p
> group by p.client_ip.country_code
> having count(*) > 1;
>
> +---------------+--------------+
> | country_code  | event_count  |
> +---------------+--------------+
> | US            | 4            |
> | null          | 2            |
> +---------------+--------------+
> 2 rows selected (0.285 seconds)
>
> Looks like the query gets the correct answer.
>
>
> [1]
>
> http://stackoverflow.com/questions/2068682/why-cant-i-use-alias-in-a-count-column-and-reference-it-in-a-having-clause
>
>
> On Thu, Jul 30, 2015 at 8:40 AM, Stefán Baxter <st...@activitystream.com>
> wrote:
>
> > Hi,
> >
> > event_count is a column alias for count(*) (without the as).
> >
> > Regards,
> >  -Stefan
> >
> > On Thu, Jul 30, 2015 at 3:32 PM, Jacques Nadeau <ja...@dremio.com>
> > wrote:
> >
> > > select p.client_ip.country_code country_code, count(*) event_count from
> > > dfs.tmp.`/events` as p where event_count > 30000 group by
> > > p.client_ip.country_code;
> > > +---------------+--------------+
> > > | country_code  | event_count  |
> > > +---------------+--------------+
> > > +---------------+--------------+
> > >
> > > In this situation, a fully-schemaed system would have stated that the
> > > event_count column does not exist.  In the case of json, we don't know
> if
> > > you're unlying data has an event_count column.  When we actually try to
> > get
> > > it, we find it doesn't exist and is therefore null.  Null is never
> > greater
> > > than 30000 so we exclude all records.  Modify your data to look like
> > this:
> > >
> > > {event_count: 30001, "client_ip":{"country_code":"US"}}
> > > {event_count: 30001, "client_ip":{"country_code":"US"}}
> > > {event_count: 30001, "client_ip":{"country_code":"US"}}
> > > {event_count: 30001, "client_ip":{"country_code":"GB"}}
> > > {event_count: 30001, "client_ip":{"country_code":"US"}}
> > >
> > > And you'll see what your query is actually doing.  In SQL, an aggregate
> > > function is applied after the WHERE clause.  The only way to apply a
> > filter
> > > after an aggregate is with the HAVING clause.
> > >
> > >
> > >
> > > --
> > > Jacques Nadeau
> > > CTO and Co-Founder, Dremio
> > >
> > > On Thu, Jul 30, 2015 at 8:20 AM, Andries Engelbrecht <
> > > aengelbrecht@maprtech.com> wrote:
> > >
> > > > Last I checked group by and having clauses do not support column
> > aliases.
> > > >
> > > > —Andries
> > > >
> > > >
> > > > > On Jul 30, 2015, at 7:19 AM, Stefán Baxter <
> > stefan@activitystream.com>
> > > > wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > That last case works as expected, sorry, this test data does have
> > null
> > > > > values for country_code.
> > > > >
> > > > > That means that I have a working solution but that it would be nice
> > if
> > > v1
> > > > > (above) would work.
> > > > >
> > > > > Thank you,
> > > > > -Stefán
> > > > >
> > > > > On Thu, Jul 30, 2015 at 2:15 PM, Stefán Baxter <
> > > > stefan@activitystream.com>
> > > > > wrote:
> > > > >
> > > > >> make that "Count in where or having clause does not work as
> > expected"
> > > > >>
> > > > >> On Thu, Jul 30, 2015 at 2:14 PM, Stefán Baxter <
> > > > stefan@activitystream.com>
> > > > >> wrote:
> > > > >>
> > > > >>> Hi,
> > > > >>>
> > > > >>> I have data that can be reduced to this:
> > > > >>>
> > > > >>>   - {"client_ip":{"country_code":"US"}}
> > > > >>>   - {"client_ip":{"country_code":"US"}}
> > > > >>>   - {"client_ip":{"country_code":"US"}}
> > > > >>>   - {"client_ip":{"country_code":"GB"}}
> > > > >>>   - {"client_ip":{"country_code":"US"}}
> > > > >>>
> > > > >>> This works fine:
> > > > >>>
> > > > >>> select p.client_ip.country_code country_code, count(*)
> event_count
> > > from
> > > > >>> dfs.tmp.`/events` as p group by p.client_ip.country_code;
> > > > >>>
> > > > >>>
> > > > >>> +---------+--------------+
> > > > >>> | country_code  | event_count  |
> > > > >>> +---------+--------------+
> > > > >>> | US      | 21516        |
> > > > >>> | GB      | 323594       |
> > > > >>> +---------+--------------+
> > > > >>>
> > > > >>>
> > > > >>> This simple where clause returns nothing:
> > > > >>>
> > > > >>> select p.client_ip.country_code country_code, count(*)
> event_count
> > > from
> > > > >>> dfs.tmp.`/events` as p where event_count > 30000 group by
> > > > >>> p.client_ip.country_code;
> > > > >>> +---------------+--------------+
> > > > >>> | country_code  | event_count  |
> > > > >>> +---------------+--------------+
> > > > >>> +---------------+--------------+
> > > > >>>
> > > > >>>
> > > > >>> Using a having clause (v1):
> > > > >>>
> > > > >>> select p.client_ip.country_code country_code, count(*)
> event_count
> > > from
> > > > >>> dfs.tmp.`/events` as p group by p.client_ip.country_code having
> > > > event_count
> > > > >>>> 30000;
> > > > >>>
> > > > >>> Error: PARSE ERROR: From line 1, column 169 to line 1, column
> 179:
> > > > >>> Expression 'event_count' is not being grouped
> > > > >>>
> > > > >>>
> > > > >>> Using a having clause (v2):
> > > > >>>
> > > > >>> select p.client_ip.country_code country_code, count(*)
> event_count
> > > from
> > > > >>> dfs.tmp.`/events` as p group by p.client_ip.country_code having
> > > > count(*) >
> > > > >>> 30000;
> > > > >>> +---------------+--------------+
> > > > >>> | country_code  | event_count  |
> > > > >>> +---------------+--------------+
> > > > >>> | GB            | 323594       |
> > > > >>> | null*          | 566667       |
> > > > >>> +---------------+--------------+
> > > > >>>
> > > > >>>
> > > > >>> * there are no null values in the database and this looks to be
> the
> > > > >>> "total for the rest"
> > > > >>>
> > > > >>>
> > > > >>> Does anyone know how this can be made to work?
> > > > >>>
> > > > >>> Regards,
> > > > >>> -Stefan
> > > > >>>
> > > > >>
> > > > >>
> > > >
> > > >
> > >
> >
>

Re: Count where or having clause does not work as expected

Posted by Jinfeng Ni <ji...@gmail.com>.
This is related to using alias in having clause [1].  Drill does not allow
using alias in having clause.


Q1.
select p.client_ip.country_code country_code, count(*) event_count from
dfs.tmp.`/events` as p where event_count > 30000 group by
p.client_ip.country_code;

In Q1, the event-count in the where clause is not refer to the alias
"event_count". In stead, it is resolved to refer to a column from the table
"events".  In the table, "event_count" does not exists, and is treated as
null. That's why you get 0 rows.

Q2.
select p.client_ip.country_code country_code, count(*) event_count from
dfs.tmp.`/events` as p group by p.client_ip.country_code having event_count
> 30000;

Again, "event_count" in having clause is not resolved to alias
"event_count" in select list. That's why SQL validation fail.


Q3
select p.client_ip.country_code country_code, count(*) event_count from
dfs.tmp.`/events` as p group by p.client_ip.country_code having count(*) >
30000;

This is the right approach.

Regarding the null value and the total count, I tried with a slightly
modified input ( adding two rows with country_code as null).

{"client_ip":{"country_code":"US"}}
{"client_ip":{"country_code":"US"}}
{"client_ip":{"country_code":"US"}}
{"client_ip":{"country_code":"GB"}}
{"client_ip":{"country_code":"US"}}
{"client_ip":{}}
{"client_ip":{}}


select p.client_ip.country_code country_code, count(*) event_count
from dfs.tmp.`events.json` as p
group by p.client_ip.country_code
having count(*) > 1;

+---------------+--------------+
| country_code  | event_count  |
+---------------+--------------+
| US            | 4            |
| null          | 2            |
+---------------+--------------+
2 rows selected (0.285 seconds)

Looks like the query gets the correct answer.


[1]
http://stackoverflow.com/questions/2068682/why-cant-i-use-alias-in-a-count-column-and-reference-it-in-a-having-clause


On Thu, Jul 30, 2015 at 8:40 AM, Stefán Baxter <st...@activitystream.com>
wrote:

> Hi,
>
> event_count is a column alias for count(*) (without the as).
>
> Regards,
>  -Stefan
>
> On Thu, Jul 30, 2015 at 3:32 PM, Jacques Nadeau <ja...@dremio.com>
> wrote:
>
> > select p.client_ip.country_code country_code, count(*) event_count from
> > dfs.tmp.`/events` as p where event_count > 30000 group by
> > p.client_ip.country_code;
> > +---------------+--------------+
> > | country_code  | event_count  |
> > +---------------+--------------+
> > +---------------+--------------+
> >
> > In this situation, a fully-schemaed system would have stated that the
> > event_count column does not exist.  In the case of json, we don't know if
> > you're unlying data has an event_count column.  When we actually try to
> get
> > it, we find it doesn't exist and is therefore null.  Null is never
> greater
> > than 30000 so we exclude all records.  Modify your data to look like
> this:
> >
> > {event_count: 30001, "client_ip":{"country_code":"US"}}
> > {event_count: 30001, "client_ip":{"country_code":"US"}}
> > {event_count: 30001, "client_ip":{"country_code":"US"}}
> > {event_count: 30001, "client_ip":{"country_code":"GB"}}
> > {event_count: 30001, "client_ip":{"country_code":"US"}}
> >
> > And you'll see what your query is actually doing.  In SQL, an aggregate
> > function is applied after the WHERE clause.  The only way to apply a
> filter
> > after an aggregate is with the HAVING clause.
> >
> >
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Thu, Jul 30, 2015 at 8:20 AM, Andries Engelbrecht <
> > aengelbrecht@maprtech.com> wrote:
> >
> > > Last I checked group by and having clauses do not support column
> aliases.
> > >
> > > —Andries
> > >
> > >
> > > > On Jul 30, 2015, at 7:19 AM, Stefán Baxter <
> stefan@activitystream.com>
> > > wrote:
> > > >
> > > > Hi,
> > > >
> > > > That last case works as expected, sorry, this test data does have
> null
> > > > values for country_code.
> > > >
> > > > That means that I have a working solution but that it would be nice
> if
> > v1
> > > > (above) would work.
> > > >
> > > > Thank you,
> > > > -Stefán
> > > >
> > > > On Thu, Jul 30, 2015 at 2:15 PM, Stefán Baxter <
> > > stefan@activitystream.com>
> > > > wrote:
> > > >
> > > >> make that "Count in where or having clause does not work as
> expected"
> > > >>
> > > >> On Thu, Jul 30, 2015 at 2:14 PM, Stefán Baxter <
> > > stefan@activitystream.com>
> > > >> wrote:
> > > >>
> > > >>> Hi,
> > > >>>
> > > >>> I have data that can be reduced to this:
> > > >>>
> > > >>>   - {"client_ip":{"country_code":"US"}}
> > > >>>   - {"client_ip":{"country_code":"US"}}
> > > >>>   - {"client_ip":{"country_code":"US"}}
> > > >>>   - {"client_ip":{"country_code":"GB"}}
> > > >>>   - {"client_ip":{"country_code":"US"}}
> > > >>>
> > > >>> This works fine:
> > > >>>
> > > >>> select p.client_ip.country_code country_code, count(*) event_count
> > from
> > > >>> dfs.tmp.`/events` as p group by p.client_ip.country_code;
> > > >>>
> > > >>>
> > > >>> +---------+--------------+
> > > >>> | country_code  | event_count  |
> > > >>> +---------+--------------+
> > > >>> | US      | 21516        |
> > > >>> | GB      | 323594       |
> > > >>> +---------+--------------+
> > > >>>
> > > >>>
> > > >>> This simple where clause returns nothing:
> > > >>>
> > > >>> select p.client_ip.country_code country_code, count(*) event_count
> > from
> > > >>> dfs.tmp.`/events` as p where event_count > 30000 group by
> > > >>> p.client_ip.country_code;
> > > >>> +---------------+--------------+
> > > >>> | country_code  | event_count  |
> > > >>> +---------------+--------------+
> > > >>> +---------------+--------------+
> > > >>>
> > > >>>
> > > >>> Using a having clause (v1):
> > > >>>
> > > >>> select p.client_ip.country_code country_code, count(*) event_count
> > from
> > > >>> dfs.tmp.`/events` as p group by p.client_ip.country_code having
> > > event_count
> > > >>>> 30000;
> > > >>>
> > > >>> Error: PARSE ERROR: From line 1, column 169 to line 1, column 179:
> > > >>> Expression 'event_count' is not being grouped
> > > >>>
> > > >>>
> > > >>> Using a having clause (v2):
> > > >>>
> > > >>> select p.client_ip.country_code country_code, count(*) event_count
> > from
> > > >>> dfs.tmp.`/events` as p group by p.client_ip.country_code having
> > > count(*) >
> > > >>> 30000;
> > > >>> +---------------+--------------+
> > > >>> | country_code  | event_count  |
> > > >>> +---------------+--------------+
> > > >>> | GB            | 323594       |
> > > >>> | null*          | 566667       |
> > > >>> +---------------+--------------+
> > > >>>
> > > >>>
> > > >>> * there are no null values in the database and this looks to be the
> > > >>> "total for the rest"
> > > >>>
> > > >>>
> > > >>> Does anyone know how this can be made to work?
> > > >>>
> > > >>> Regards,
> > > >>> -Stefan
> > > >>>
> > > >>
> > > >>
> > >
> > >
> >
>

Re: Count where or having clause does not work as expected

Posted by Stefán Baxter <st...@activitystream.com>.
Hi,

event_count is a column alias for count(*) (without the as).

Regards,
 -Stefan

On Thu, Jul 30, 2015 at 3:32 PM, Jacques Nadeau <ja...@dremio.com> wrote:

> select p.client_ip.country_code country_code, count(*) event_count from
> dfs.tmp.`/events` as p where event_count > 30000 group by
> p.client_ip.country_code;
> +---------------+--------------+
> | country_code  | event_count  |
> +---------------+--------------+
> +---------------+--------------+
>
> In this situation, a fully-schemaed system would have stated that the
> event_count column does not exist.  In the case of json, we don't know if
> you're unlying data has an event_count column.  When we actually try to get
> it, we find it doesn't exist and is therefore null.  Null is never greater
> than 30000 so we exclude all records.  Modify your data to look like this:
>
> {event_count: 30001, "client_ip":{"country_code":"US"}}
> {event_count: 30001, "client_ip":{"country_code":"US"}}
> {event_count: 30001, "client_ip":{"country_code":"US"}}
> {event_count: 30001, "client_ip":{"country_code":"GB"}}
> {event_count: 30001, "client_ip":{"country_code":"US"}}
>
> And you'll see what your query is actually doing.  In SQL, an aggregate
> function is applied after the WHERE clause.  The only way to apply a filter
> after an aggregate is with the HAVING clause.
>
>
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Thu, Jul 30, 2015 at 8:20 AM, Andries Engelbrecht <
> aengelbrecht@maprtech.com> wrote:
>
> > Last I checked group by and having clauses do not support column aliases.
> >
> > —Andries
> >
> >
> > > On Jul 30, 2015, at 7:19 AM, Stefán Baxter <st...@activitystream.com>
> > wrote:
> > >
> > > Hi,
> > >
> > > That last case works as expected, sorry, this test data does have null
> > > values for country_code.
> > >
> > > That means that I have a working solution but that it would be nice if
> v1
> > > (above) would work.
> > >
> > > Thank you,
> > > -Stefán
> > >
> > > On Thu, Jul 30, 2015 at 2:15 PM, Stefán Baxter <
> > stefan@activitystream.com>
> > > wrote:
> > >
> > >> make that "Count in where or having clause does not work as expected"
> > >>
> > >> On Thu, Jul 30, 2015 at 2:14 PM, Stefán Baxter <
> > stefan@activitystream.com>
> > >> wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> I have data that can be reduced to this:
> > >>>
> > >>>   - {"client_ip":{"country_code":"US"}}
> > >>>   - {"client_ip":{"country_code":"US"}}
> > >>>   - {"client_ip":{"country_code":"US"}}
> > >>>   - {"client_ip":{"country_code":"GB"}}
> > >>>   - {"client_ip":{"country_code":"US"}}
> > >>>
> > >>> This works fine:
> > >>>
> > >>> select p.client_ip.country_code country_code, count(*) event_count
> from
> > >>> dfs.tmp.`/events` as p group by p.client_ip.country_code;
> > >>>
> > >>>
> > >>> +---------+--------------+
> > >>> | country_code  | event_count  |
> > >>> +---------+--------------+
> > >>> | US      | 21516        |
> > >>> | GB      | 323594       |
> > >>> +---------+--------------+
> > >>>
> > >>>
> > >>> This simple where clause returns nothing:
> > >>>
> > >>> select p.client_ip.country_code country_code, count(*) event_count
> from
> > >>> dfs.tmp.`/events` as p where event_count > 30000 group by
> > >>> p.client_ip.country_code;
> > >>> +---------------+--------------+
> > >>> | country_code  | event_count  |
> > >>> +---------------+--------------+
> > >>> +---------------+--------------+
> > >>>
> > >>>
> > >>> Using a having clause (v1):
> > >>>
> > >>> select p.client_ip.country_code country_code, count(*) event_count
> from
> > >>> dfs.tmp.`/events` as p group by p.client_ip.country_code having
> > event_count
> > >>>> 30000;
> > >>>
> > >>> Error: PARSE ERROR: From line 1, column 169 to line 1, column 179:
> > >>> Expression 'event_count' is not being grouped
> > >>>
> > >>>
> > >>> Using a having clause (v2):
> > >>>
> > >>> select p.client_ip.country_code country_code, count(*) event_count
> from
> > >>> dfs.tmp.`/events` as p group by p.client_ip.country_code having
> > count(*) >
> > >>> 30000;
> > >>> +---------------+--------------+
> > >>> | country_code  | event_count  |
> > >>> +---------------+--------------+
> > >>> | GB            | 323594       |
> > >>> | null*          | 566667       |
> > >>> +---------------+--------------+
> > >>>
> > >>>
> > >>> * there are no null values in the database and this looks to be the
> > >>> "total for the rest"
> > >>>
> > >>>
> > >>> Does anyone know how this can be made to work?
> > >>>
> > >>> Regards,
> > >>> -Stefan
> > >>>
> > >>
> > >>
> >
> >
>

Re: Count where or having clause does not work as expected

Posted by Jacques Nadeau <ja...@dremio.com>.
select p.client_ip.country_code country_code, count(*) event_count from
dfs.tmp.`/events` as p where event_count > 30000 group by
p.client_ip.country_code;
+---------------+--------------+
| country_code  | event_count  |
+---------------+--------------+
+---------------+--------------+

In this situation, a fully-schemaed system would have stated that the
event_count column does not exist.  In the case of json, we don't know if
you're unlying data has an event_count column.  When we actually try to get
it, we find it doesn't exist and is therefore null.  Null is never greater
than 30000 so we exclude all records.  Modify your data to look like this:

{event_count: 30001, "client_ip":{"country_code":"US"}}
{event_count: 30001, "client_ip":{"country_code":"US"}}
{event_count: 30001, "client_ip":{"country_code":"US"}}
{event_count: 30001, "client_ip":{"country_code":"GB"}}
{event_count: 30001, "client_ip":{"country_code":"US"}}

And you'll see what your query is actually doing.  In SQL, an aggregate
function is applied after the WHERE clause.  The only way to apply a filter
after an aggregate is with the HAVING clause.



--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Thu, Jul 30, 2015 at 8:20 AM, Andries Engelbrecht <
aengelbrecht@maprtech.com> wrote:

> Last I checked group by and having clauses do not support column aliases.
>
> —Andries
>
>
> > On Jul 30, 2015, at 7:19 AM, Stefán Baxter <st...@activitystream.com>
> wrote:
> >
> > Hi,
> >
> > That last case works as expected, sorry, this test data does have null
> > values for country_code.
> >
> > That means that I have a working solution but that it would be nice if v1
> > (above) would work.
> >
> > Thank you,
> > -Stefán
> >
> > On Thu, Jul 30, 2015 at 2:15 PM, Stefán Baxter <
> stefan@activitystream.com>
> > wrote:
> >
> >> make that "Count in where or having clause does not work as expected"
> >>
> >> On Thu, Jul 30, 2015 at 2:14 PM, Stefán Baxter <
> stefan@activitystream.com>
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> I have data that can be reduced to this:
> >>>
> >>>   - {"client_ip":{"country_code":"US"}}
> >>>   - {"client_ip":{"country_code":"US"}}
> >>>   - {"client_ip":{"country_code":"US"}}
> >>>   - {"client_ip":{"country_code":"GB"}}
> >>>   - {"client_ip":{"country_code":"US"}}
> >>>
> >>> This works fine:
> >>>
> >>> select p.client_ip.country_code country_code, count(*) event_count from
> >>> dfs.tmp.`/events` as p group by p.client_ip.country_code;
> >>>
> >>>
> >>> +---------+--------------+
> >>> | country_code  | event_count  |
> >>> +---------+--------------+
> >>> | US      | 21516        |
> >>> | GB      | 323594       |
> >>> +---------+--------------+
> >>>
> >>>
> >>> This simple where clause returns nothing:
> >>>
> >>> select p.client_ip.country_code country_code, count(*) event_count from
> >>> dfs.tmp.`/events` as p where event_count > 30000 group by
> >>> p.client_ip.country_code;
> >>> +---------------+--------------+
> >>> | country_code  | event_count  |
> >>> +---------------+--------------+
> >>> +---------------+--------------+
> >>>
> >>>
> >>> Using a having clause (v1):
> >>>
> >>> select p.client_ip.country_code country_code, count(*) event_count from
> >>> dfs.tmp.`/events` as p group by p.client_ip.country_code having
> event_count
> >>>> 30000;
> >>>
> >>> Error: PARSE ERROR: From line 1, column 169 to line 1, column 179:
> >>> Expression 'event_count' is not being grouped
> >>>
> >>>
> >>> Using a having clause (v2):
> >>>
> >>> select p.client_ip.country_code country_code, count(*) event_count from
> >>> dfs.tmp.`/events` as p group by p.client_ip.country_code having
> count(*) >
> >>> 30000;
> >>> +---------------+--------------+
> >>> | country_code  | event_count  |
> >>> +---------------+--------------+
> >>> | GB            | 323594       |
> >>> | null*          | 566667       |
> >>> +---------------+--------------+
> >>>
> >>>
> >>> * there are no null values in the database and this looks to be the
> >>> "total for the rest"
> >>>
> >>>
> >>> Does anyone know how this can be made to work?
> >>>
> >>> Regards,
> >>> -Stefan
> >>>
> >>
> >>
>
>

Re: Count where or having clause does not work as expected

Posted by Andries Engelbrecht <ae...@maprtech.com>.
Last I checked group by and having clauses do not support column aliases.

—Andries


> On Jul 30, 2015, at 7:19 AM, Stefán Baxter <st...@activitystream.com> wrote:
> 
> Hi,
> 
> That last case works as expected, sorry, this test data does have null
> values for country_code.
> 
> That means that I have a working solution but that it would be nice if v1
> (above) would work.
> 
> Thank you,
> -Stefán
> 
> On Thu, Jul 30, 2015 at 2:15 PM, Stefán Baxter <st...@activitystream.com>
> wrote:
> 
>> make that "Count in where or having clause does not work as expected"
>> 
>> On Thu, Jul 30, 2015 at 2:14 PM, Stefán Baxter <st...@activitystream.com>
>> wrote:
>> 
>>> Hi,
>>> 
>>> I have data that can be reduced to this:
>>> 
>>>   - {"client_ip":{"country_code":"US"}}
>>>   - {"client_ip":{"country_code":"US"}}
>>>   - {"client_ip":{"country_code":"US"}}
>>>   - {"client_ip":{"country_code":"GB"}}
>>>   - {"client_ip":{"country_code":"US"}}
>>> 
>>> This works fine:
>>> 
>>> select p.client_ip.country_code country_code, count(*) event_count from
>>> dfs.tmp.`/events` as p group by p.client_ip.country_code;
>>> 
>>> 
>>> +---------+--------------+
>>> | country_code  | event_count  |
>>> +---------+--------------+
>>> | US      | 21516        |
>>> | GB      | 323594       |
>>> +---------+--------------+
>>> 
>>> 
>>> This simple where clause returns nothing:
>>> 
>>> select p.client_ip.country_code country_code, count(*) event_count from
>>> dfs.tmp.`/events` as p where event_count > 30000 group by
>>> p.client_ip.country_code;
>>> +---------------+--------------+
>>> | country_code  | event_count  |
>>> +---------------+--------------+
>>> +---------------+--------------+
>>> 
>>> 
>>> Using a having clause (v1):
>>> 
>>> select p.client_ip.country_code country_code, count(*) event_count from
>>> dfs.tmp.`/events` as p group by p.client_ip.country_code having event_count
>>>> 30000;
>>> 
>>> Error: PARSE ERROR: From line 1, column 169 to line 1, column 179:
>>> Expression 'event_count' is not being grouped
>>> 
>>> 
>>> Using a having clause (v2):
>>> 
>>> select p.client_ip.country_code country_code, count(*) event_count from
>>> dfs.tmp.`/events` as p group by p.client_ip.country_code having count(*) >
>>> 30000;
>>> +---------------+--------------+
>>> | country_code  | event_count  |
>>> +---------------+--------------+
>>> | GB            | 323594       |
>>> | null*          | 566667       |
>>> +---------------+--------------+
>>> 
>>> 
>>> * there are no null values in the database and this looks to be the
>>> "total for the rest"
>>> 
>>> 
>>> Does anyone know how this can be made to work?
>>> 
>>> Regards,
>>> -Stefan
>>> 
>> 
>> 


Re: Count where or having clause does not work as expected

Posted by Stefán Baxter <st...@activitystream.com>.
Hi,

That last case works as expected, sorry, this test data does have null
values for country_code.

That means that I have a working solution but that it would be nice if v1
(above) would work.

Thank you,
 -Stefán

On Thu, Jul 30, 2015 at 2:15 PM, Stefán Baxter <st...@activitystream.com>
wrote:

> make that "Count in where or having clause does not work as expected"
>
> On Thu, Jul 30, 2015 at 2:14 PM, Stefán Baxter <st...@activitystream.com>
> wrote:
>
>> Hi,
>>
>> I have data that can be reduced to this:
>>
>>    - {"client_ip":{"country_code":"US"}}
>>    - {"client_ip":{"country_code":"US"}}
>>    - {"client_ip":{"country_code":"US"}}
>>    - {"client_ip":{"country_code":"GB"}}
>>    - {"client_ip":{"country_code":"US"}}
>>
>> This works fine:
>>
>> select p.client_ip.country_code country_code, count(*) event_count from
>> dfs.tmp.`/events` as p group by p.client_ip.country_code;
>>
>>
>> +---------+--------------+
>> | country_code  | event_count  |
>> +---------+--------------+
>> | US      | 21516        |
>> | GB      | 323594       |
>> +---------+--------------+
>>
>>
>> This simple where clause returns nothing:
>>
>> select p.client_ip.country_code country_code, count(*) event_count from
>> dfs.tmp.`/events` as p where event_count > 30000 group by
>> p.client_ip.country_code;
>> +---------------+--------------+
>> | country_code  | event_count  |
>> +---------------+--------------+
>> +---------------+--------------+
>>
>>
>> Using a having clause (v1):
>>
>> select p.client_ip.country_code country_code, count(*) event_count from
>> dfs.tmp.`/events` as p group by p.client_ip.country_code having event_count
>> > 30000;
>>
>> Error: PARSE ERROR: From line 1, column 169 to line 1, column 179:
>> Expression 'event_count' is not being grouped
>>
>>
>> Using a having clause (v2):
>>
>> select p.client_ip.country_code country_code, count(*) event_count from
>> dfs.tmp.`/events` as p group by p.client_ip.country_code having count(*) >
>> 30000;
>> +---------------+--------------+
>> | country_code  | event_count  |
>> +---------------+--------------+
>> | GB            | 323594       |
>> | null*          | 566667       |
>> +---------------+--------------+
>>
>>
>> * there are no null values in the database and this looks to be the
>> "total for the rest"
>>
>>
>> Does anyone know how this can be made to work?
>>
>> Regards,
>>  -Stefan
>>
>
>

Re: Count where or having clause does not work as expected

Posted by Stefán Baxter <st...@activitystream.com>.
make that "Count in where or having clause does not work as expected"

On Thu, Jul 30, 2015 at 2:14 PM, Stefán Baxter <st...@activitystream.com>
wrote:

> Hi,
>
> I have data that can be reduced to this:
>
>    - {"client_ip":{"country_code":"US"}}
>    - {"client_ip":{"country_code":"US"}}
>    - {"client_ip":{"country_code":"US"}}
>    - {"client_ip":{"country_code":"GB"}}
>    - {"client_ip":{"country_code":"US"}}
>
> This works fine:
>
> select p.client_ip.country_code country_code, count(*) event_count from
> dfs.tmp.`/events` as p group by p.client_ip.country_code;
>
>
> +---------+--------------+
> | country_code  | event_count  |
> +---------+--------------+
> | US      | 21516        |
> | GB      | 323594       |
> +---------+--------------+
>
>
> This simple where clause returns nothing:
>
> select p.client_ip.country_code country_code, count(*) event_count from
> dfs.tmp.`/events` as p where event_count > 30000 group by
> p.client_ip.country_code;
> +---------------+--------------+
> | country_code  | event_count  |
> +---------------+--------------+
> +---------------+--------------+
>
>
> Using a having clause (v1):
>
> select p.client_ip.country_code country_code, count(*) event_count from
> dfs.tmp.`/events` as p group by p.client_ip.country_code having event_count
> > 30000;
>
> Error: PARSE ERROR: From line 1, column 169 to line 1, column 179:
> Expression 'event_count' is not being grouped
>
>
> Using a having clause (v2):
>
> select p.client_ip.country_code country_code, count(*) event_count from
> dfs.tmp.`/events` as p group by p.client_ip.country_code having count(*) >
> 30000;
> +---------------+--------------+
> | country_code  | event_count  |
> +---------------+--------------+
> | GB            | 323594       |
> | null*          | 566667       |
> +---------------+--------------+
>
>
> * there are no null values in the database and this looks to be the "total
> for the rest"
>
>
> Does anyone know how this can be made to work?
>
> Regards,
>  -Stefan
>