You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Amey Barve <am...@gmail.com> on 2016/03/22 08:14:08 UTC

select count(*) from table;

select count(*) from table;

How does hive evaluate count(*) on a table?

Does it return count by actually querying table, or directly return count
by consulting some statistics locally.

For Hive's Text format it takes few seconds while Hive's Orc format takes
fraction of seconds.

Regards,
Amey

Re: select count(*) from table;

Posted by Nitin Pawar <ni...@gmail.com>.
If you have enabled performance optimization by enabling statistics it will
come from there
if the underlying file format supports infile statistics (like ORC), it
will come from there
if its just plain vanilla text file format, it needs to run a job to get
the count so the longest of all

On Tue, Mar 22, 2016 at 12:44 PM, Amey Barve <am...@gmail.com> wrote:

> select count(*) from table;
>
> How does hive evaluate count(*) on a table?
>
> Does it return count by actually querying table, or directly return count
> by consulting some statistics locally.
>
> For Hive's Text format it takes few seconds while Hive's Orc format takes
> fraction of seconds.
>
> Regards,
> Amey
>



-- 
Nitin Pawar

Re: select count(*) from table;

Posted by Nitin Pawar <ni...@gmail.com>.
If you have enabled performance optimization by enabling statistics it will
come from there
if the underlying file format supports infile statistics (like ORC), it
will come from there
if its just plain vanilla text file format, it needs to run a job to get
the count so the longest of all

On Tue, Mar 22, 2016 at 12:44 PM, Amey Barve <am...@gmail.com> wrote:

> select count(*) from table;
>
> How does hive evaluate count(*) on a table?
>
> Does it return count by actually querying table, or directly return count
> by consulting some statistics locally.
>
> For Hive's Text format it takes few seconds while Hive's Orc format takes
> fraction of seconds.
>
> Regards,
> Amey
>



-- 
Nitin Pawar

Re: select count(*) from table;

Posted by Amey Barve <am...@gmail.com>.
Hi All,

Can custom storage handlers get information for queries like count, max,
min etc. from hive directly so that for each of such queries RecordReader
need not fetch all the records?

Regards,
Amey

On Tue, Mar 22, 2016 at 1:46 PM, Amey Barve <am...@gmail.com> wrote:

> Thanks Nitin, Mich,
>
> if its just plain vanilla text file format, it needs to run a job to get
> the count so the longest of all
> --> Hive must be translating some operator like fetch (for count) into a
> map-reduce job and getting the result?
> Can a custom storage handler get information about the operator/s for
> count(*) and then use it to retrieve the results.
>
> I want to know whether custom storage handler can get information about
> operators that hive constructs for queries like count, max, min etc. so
> that storage handler can map these to internal storage functions?
>
> Regards,
> Amey
>
> On Tue, Mar 22, 2016 at 1:32 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> ORC file has the following stats levels for storage indexes
>>
>>
>>    1. ORC File itself
>>    2. Multiple stripes (chunks) within the ORC file
>>    3. Multiple row groups (row batches) within each stripe
>>
>> Assuming that the underlying table has stats updated, count will be
>> stored for each column
>>
>> So when we do something like below:
>>
>> select count(1) from orctest
>>
>> you can see stats collected if you do
>>
>> show create table orctest;
>>
>>  TBLPROPERTIES (                                              |
>> |   'COLUMN_STATS_ACCURATE'='true',                            |
>> |   'numFiles'='31',                                           |
>> |   *'numRows'='250000'*,                                        |
>>
>>
>> File statistics, Stripe statistics and row group statistics are kept. So
>> ORC table will rely on those if needed
>>
>>
>> HTH
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 22 March 2016 at 07:14, Amey Barve <am...@gmail.com> wrote:
>>
>>> select count(*) from table;
>>>
>>> How does hive evaluate count(*) on a table?
>>>
>>> Does it return count by actually querying table, or directly return
>>> count by consulting some statistics locally.
>>>
>>> For Hive's Text format it takes few seconds while Hive's Orc format
>>> takes fraction of seconds.
>>>
>>> Regards,
>>> Amey
>>>
>>
>>
>

Re: select count(*) from table;

Posted by Amey Barve <am...@gmail.com>.
Hi All,

Can custom storage handlers get information for queries like count, max,
min etc. from hive directly so that for each of such queries RecordReader
need not fetch all the records?

Regards,
Amey

On Tue, Mar 22, 2016 at 1:46 PM, Amey Barve <am...@gmail.com> wrote:

> Thanks Nitin, Mich,
>
> if its just plain vanilla text file format, it needs to run a job to get
> the count so the longest of all
> --> Hive must be translating some operator like fetch (for count) into a
> map-reduce job and getting the result?
> Can a custom storage handler get information about the operator/s for
> count(*) and then use it to retrieve the results.
>
> I want to know whether custom storage handler can get information about
> operators that hive constructs for queries like count, max, min etc. so
> that storage handler can map these to internal storage functions?
>
> Regards,
> Amey
>
> On Tue, Mar 22, 2016 at 1:32 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> ORC file has the following stats levels for storage indexes
>>
>>
>>    1. ORC File itself
>>    2. Multiple stripes (chunks) within the ORC file
>>    3. Multiple row groups (row batches) within each stripe
>>
>> Assuming that the underlying table has stats updated, count will be
>> stored for each column
>>
>> So when we do something like below:
>>
>> select count(1) from orctest
>>
>> you can see stats collected if you do
>>
>> show create table orctest;
>>
>>  TBLPROPERTIES (                                              |
>> |   'COLUMN_STATS_ACCURATE'='true',                            |
>> |   'numFiles'='31',                                           |
>> |   *'numRows'='250000'*,                                        |
>>
>>
>> File statistics, Stripe statistics and row group statistics are kept. So
>> ORC table will rely on those if needed
>>
>>
>> HTH
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 22 March 2016 at 07:14, Amey Barve <am...@gmail.com> wrote:
>>
>>> select count(*) from table;
>>>
>>> How does hive evaluate count(*) on a table?
>>>
>>> Does it return count by actually querying table, or directly return
>>> count by consulting some statistics locally.
>>>
>>> For Hive's Text format it takes few seconds while Hive's Orc format
>>> takes fraction of seconds.
>>>
>>> Regards,
>>> Amey
>>>
>>
>>
>

Re: select count(*) from table;

Posted by Amey Barve <am...@gmail.com>.
Thanks Nitin, Mich,

if its just plain vanilla text file format, it needs to run a job to get
the count so the longest of all
--> Hive must be translating some operator like fetch (for count) into a
map-reduce job and getting the result?
Can a custom storage handler get information about the operator/s for
count(*) and then use it to retrieve the results.

I want to know whether custom storage handler can get information about
operators that hive constructs for queries like count, max, min etc. so
that storage handler can map these to internal storage functions?

Regards,
Amey

On Tue, Mar 22, 2016 at 1:32 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> ORC file has the following stats levels for storage indexes
>
>
>    1. ORC File itself
>    2. Multiple stripes (chunks) within the ORC file
>    3. Multiple row groups (row batches) within each stripe
>
> Assuming that the underlying table has stats updated, count will be stored
> for each column
>
> So when we do something like below:
>
> select count(1) from orctest
>
> you can see stats collected if you do
>
> show create table orctest;
>
>  TBLPROPERTIES (                                              |
> |   'COLUMN_STATS_ACCURATE'='true',                            |
> |   'numFiles'='31',                                           |
> |   *'numRows'='250000'*,                                        |
>
>
> File statistics, Stripe statistics and row group statistics are kept. So
> ORC table will rely on those if needed
>
>
> HTH
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 22 March 2016 at 07:14, Amey Barve <am...@gmail.com> wrote:
>
>> select count(*) from table;
>>
>> How does hive evaluate count(*) on a table?
>>
>> Does it return count by actually querying table, or directly return count
>> by consulting some statistics locally.
>>
>> For Hive's Text format it takes few seconds while Hive's Orc format takes
>> fraction of seconds.
>>
>> Regards,
>> Amey
>>
>
>

Re: select count(*) from table;

Posted by Amey Barve <am...@gmail.com>.
Thanks Nitin, Mich,

if its just plain vanilla text file format, it needs to run a job to get
the count so the longest of all
--> Hive must be translating some operator like fetch (for count) into a
map-reduce job and getting the result?
Can a custom storage handler get information about the operator/s for
count(*) and then use it to retrieve the results.

I want to know whether custom storage handler can get information about
operators that hive constructs for queries like count, max, min etc. so
that storage handler can map these to internal storage functions?

Regards,
Amey

On Tue, Mar 22, 2016 at 1:32 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> ORC file has the following stats levels for storage indexes
>
>
>    1. ORC File itself
>    2. Multiple stripes (chunks) within the ORC file
>    3. Multiple row groups (row batches) within each stripe
>
> Assuming that the underlying table has stats updated, count will be stored
> for each column
>
> So when we do something like below:
>
> select count(1) from orctest
>
> you can see stats collected if you do
>
> show create table orctest;
>
>  TBLPROPERTIES (                                              |
> |   'COLUMN_STATS_ACCURATE'='true',                            |
> |   'numFiles'='31',                                           |
> |   *'numRows'='250000'*,                                        |
>
>
> File statistics, Stripe statistics and row group statistics are kept. So
> ORC table will rely on those if needed
>
>
> HTH
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 22 March 2016 at 07:14, Amey Barve <am...@gmail.com> wrote:
>
>> select count(*) from table;
>>
>> How does hive evaluate count(*) on a table?
>>
>> Does it return count by actually querying table, or directly return count
>> by consulting some statistics locally.
>>
>> For Hive's Text format it takes few seconds while Hive's Orc format takes
>> fraction of seconds.
>>
>> Regards,
>> Amey
>>
>
>

Re: select count(*) from table;

Posted by Mich Talebzadeh <mi...@gmail.com>.
ORC file has the following stats levels for storage indexes


   1. ORC File itself
   2. Multiple stripes (chunks) within the ORC file
   3. Multiple row groups (row batches) within each stripe

Assuming that the underlying table has stats updated, count will be stored
for each column

So when we do something like below:

select count(1) from orctest

you can see stats collected if you do

show create table orctest;

 TBLPROPERTIES (                                              |
|   'COLUMN_STATS_ACCURATE'='true',                            |
|   'numFiles'='31',                                           |
|   *'numRows'='250000'*,                                        |


File statistics, Stripe statistics and row group statistics are kept. So
ORC table will rely on those if needed


HTH




Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 22 March 2016 at 07:14, Amey Barve <am...@gmail.com> wrote:

> select count(*) from table;
>
> How does hive evaluate count(*) on a table?
>
> Does it return count by actually querying table, or directly return count
> by consulting some statistics locally.
>
> For Hive's Text format it takes few seconds while Hive's Orc format takes
> fraction of seconds.
>
> Regards,
> Amey
>