You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Peter Marron <Pe...@trilliumsoftware.com> on 2013/06/28 16:54:24 UTC

Override COUNT() function

Hi,

I feel sure that someone has asked for this before, but here goes...

In the case where I have the query

                SELECT COUNT(*) FROM table;

There are many cases where I can determine the count immediately.
(For example if I have run something like:

ANALYZE TABLE tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)] COMPUTE STATISTICS [noscan];

then there seems to be a table property "numRows" which holds a count of the number of rows.
Now I know that the COUNT function can't always be determined easily.
If the query is more complicated, like

                SELECT COUNT(*) FROM table GROUP BY column;

then obviously a simple scalar count is of no real use. But is there some way
to intercept the simple case and avoid running a table scan?

One problem that I see is that the COUNT function is a UDAF and I am
assuming that the presence of any aggregate function like this is enough
to force the query planner to require a Map/Reduce. Is there anyway
to make the function look like a simple UDF for some queries? Or
just for some tables? I guess that I'd be prepared to sacrifice the full
generality of the normal COUNT function for one which
only functions correctly for the simple query on my tables.

So is it possible to have a different COUNT function only on certain tables?

Regards,

Z


Re: Override COUNT() function

Posted by Navis류승우 <na...@nexr.com>.
As you expected, there is no documentation on it (like other optimizers)

Javadoc of the class might be helpful but seemed not in detail enough.

2013/7/2 Peter Marron <Pe...@trilliumsoftware.com>:
> Thanks Navis,
>
> This is a very interesting class which I feel pretty sure that I would never have found.
> Are  there any descriptions, motivations, documentation or examples anywhere?
> I suspect that there's nothing other than the source itself, but I had to ask.
>
> Regards,
>
> Z
> -----Original Message-----
> From: Navis류승우 [mailto:navis.ryu@nexr.com]
> Sent: 02 July 2013 08:50
> To: user@hive.apache.org
> Subject: Re: Override COUNT() function
>
> MetadataOnlyOptimizer changes GBY on partition columns to simple TableScan with one line dummy.
>
> I think similar things can be done with stats.
>
> 2013/6/28 Peter Marron <Pe...@trilliumsoftware.com>:
>> Hi,
>>
>>
>>
>> I feel sure that someone has asked for this before, but here goes…
>>
>>
>>
>> In the case where I have the query
>>
>>
>>
>>                 SELECT COUNT(*) FROM table;
>>
>>
>>
>> There are many cases where I can determine the count immediately.
>>
>> (For example if I have run something like:
>>
>>
>>
>> ANALYZE TABLE tablename [PARTITION(partcol1[=val1], partcol2[=val2],
>> ...)] COMPUTE STATISTICS [noscan];
>>
>>
>>
>> then there seems to be a table property “numRows” which holds a count
>> of the number of rows.
>>
>> Now I know that the COUNT function can’t always be determined easily.
>>
>> If the query is more complicated, like
>>
>>
>>
>>                 SELECT COUNT(*) FROM table GROUP BY column;
>>
>>
>>
>> then obviously a simple scalar count is of no real use. But is there
>> some way
>>
>> to intercept the simple case and avoid running a table scan?
>>
>>
>>
>> One problem that I see is that the COUNT function is a UDAF and I am
>>
>> assuming that the presence of any aggregate function like this is
>> enough
>>
>> to force the query planner to require a Map/Reduce. Is there anyway
>>
>> to make the function look like a simple UDF for some queries? Or
>>
>> just for some tables? I guess that I’d be prepared to sacrifice the
>> full
>>
>> generality of the normal COUNT function for one which
>>
>> only functions correctly for the simple query on my tables.
>>
>>
>>
>> So is it possible to have a different COUNT function only on certain tables?
>>
>>
>>
>> Regards,
>>
>>
>>
>> Z
>>
>>
>

RE: Override COUNT() function

Posted by Peter Marron <Pe...@trilliumsoftware.com>.
Thanks Navis,

This is a very interesting class which I feel pretty sure that I would never have found.
Are  there any descriptions, motivations, documentation or examples anywhere?
I suspect that there's nothing other than the source itself, but I had to ask.

Regards,

Z
-----Original Message-----
From: Navis류승우 [mailto:navis.ryu@nexr.com] 
Sent: 02 July 2013 08:50
To: user@hive.apache.org
Subject: Re: Override COUNT() function

MetadataOnlyOptimizer changes GBY on partition columns to simple TableScan with one line dummy.

I think similar things can be done with stats.

2013/6/28 Peter Marron <Pe...@trilliumsoftware.com>:
> Hi,
>
>
>
> I feel sure that someone has asked for this before, but here goes…
>
>
>
> In the case where I have the query
>
>
>
>                 SELECT COUNT(*) FROM table;
>
>
>
> There are many cases where I can determine the count immediately.
>
> (For example if I have run something like:
>
>
>
> ANALYZE TABLE tablename [PARTITION(partcol1[=val1], partcol2[=val2], 
> ...)] COMPUTE STATISTICS [noscan];
>
>
>
> then there seems to be a table property “numRows” which holds a count 
> of the number of rows.
>
> Now I know that the COUNT function can’t always be determined easily.
>
> If the query is more complicated, like
>
>
>
>                 SELECT COUNT(*) FROM table GROUP BY column;
>
>
>
> then obviously a simple scalar count is of no real use. But is there 
> some way
>
> to intercept the simple case and avoid running a table scan?
>
>
>
> One problem that I see is that the COUNT function is a UDAF and I am
>
> assuming that the presence of any aggregate function like this is 
> enough
>
> to force the query planner to require a Map/Reduce. Is there anyway
>
> to make the function look like a simple UDF for some queries? Or
>
> just for some tables? I guess that I’d be prepared to sacrifice the 
> full
>
> generality of the normal COUNT function for one which
>
> only functions correctly for the simple query on my tables.
>
>
>
> So is it possible to have a different COUNT function only on certain tables?
>
>
>
> Regards,
>
>
>
> Z
>
>


Re: Override COUNT() function

Posted by Navis류승우 <na...@nexr.com>.
MetadataOnlyOptimizer changes GBY on partition columns to simple
TableScan with one line dummy.

I think similar things can be done with stats.

2013/6/28 Peter Marron <Pe...@trilliumsoftware.com>:
> Hi,
>
>
>
> I feel sure that someone has asked for this before, but here goes…
>
>
>
> In the case where I have the query
>
>
>
>                 SELECT COUNT(*) FROM table;
>
>
>
> There are many cases where I can determine the count immediately.
>
> (For example if I have run something like:
>
>
>
> ANALYZE TABLE tablename [PARTITION(partcol1[=val1], partcol2[=val2], ...)]
> COMPUTE STATISTICS [noscan];
>
>
>
> then there seems to be a table property “numRows” which holds a count of the
> number of rows.
>
> Now I know that the COUNT function can’t always be determined easily.
>
> If the query is more complicated, like
>
>
>
>                 SELECT COUNT(*) FROM table GROUP BY column;
>
>
>
> then obviously a simple scalar count is of no real use. But is there some
> way
>
> to intercept the simple case and avoid running a table scan?
>
>
>
> One problem that I see is that the COUNT function is a UDAF and I am
>
> assuming that the presence of any aggregate function like this is enough
>
> to force the query planner to require a Map/Reduce. Is there anyway
>
> to make the function look like a simple UDF for some queries? Or
>
> just for some tables? I guess that I’d be prepared to sacrifice the full
>
> generality of the normal COUNT function for one which
>
> only functions correctly for the simple query on my tables.
>
>
>
> So is it possible to have a different COUNT function only on certain tables?
>
>
>
> Regards,
>
>
>
> Z
>
>