You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Ashok Kumar <as...@yahoo.com> on 2016/01/19 16:50:51 UTC

ORC files and statistics

 Hi,
I have read some notes on ORC files in Hive and indexes.
The document describes in the indexes but makes reference to statistics
Indexes
|   |
|   |  |   |   |   |   |   |
| IndexesIndexes ORC provides three level of indexes within each file: file level - statistics about the values in each column across the entire file  |
|  |
| View on orc.apache.org | Preview by Yahoo |
|  |
|   |


I am confused as it is mixing up indexes with statistics. Can someone clarify these.
Thanks

Re: ORC files and statistics

Posted by Owen O'Malley <om...@apache.org>.
On Tue, Jan 19, 2016 at 9:45 AM, Ashok Kumar <as...@yahoo.com> wrote:

> Thank you both.
>
> So if I have a Hive table of ORC type and it contains 100K rows, there
> will be 10 row groups of 10K row each.
>

Yes


>
> within each row group there will be min, max, count(distint_value) and sum
> for each column within that row group. is count mean count of distinct
> values including null occurrence for that column?.
>

Actually, it is just count, not count distinct. Newer versions of Hive also
have the option of including bloom filters for some columns. That enables
fast searches for particular values in columns that aren't sorted.


>
> also if the table contains 5 columns will there be 5x10 row groups in
> total?
>

The ORC files are laid out in stripes that correspond to roughly ~64MB
compressed. Each column within a stripe is laid out together. The row
groups are a feature of the index and correspond to how many entries the
index has. So yes, within a file with 100k rows, which obviously will be a
single stripe, the index will have 10 row groups for each column for a
total of 50 entries in the index. (The index is also laid out in columns so
the reader only loads the parts of the index it needs for the columns it is
reading.)

.. Owen


> thanks again
>
>
> On Tuesday, 19 January 2016, 17:35, Jörn Franke <jo...@gmail.com>
> wrote:
>
>
> Just be aware that you should insert the data sorted at least on the most
> discrimating column of your where clause
>
> On 19 Jan 2016, at 17:27, Owen O'Malley <om...@apache.org> wrote:
>
> It has both. Each index has statistics of min, max, count, and sum for
> each column in the row group of 10,000 rows. It also has the location of
> the start of each row group, so that the reader can jump straight to the
> beginning of the row group. The reader takes a SearchArgument (eg. age >
> 100)  that limits which rows are required for the query and can avoid
> reading an entire file, or at least sections of the file.
>
> .. Owen
>
> On Tue, Jan 19, 2016 at 7:50 AM, Ashok Kumar <as...@yahoo.com> wrote:
>
> Hi,
>
> I have read some notes on ORC files in Hive and indexes.
>
> The document describes in the indexes but makes reference to statistics
>
> Indexes <https://orc.apache.org/docs/indexes.html>
>
>
> [image: image] <https://orc.apache.org/docs/indexes.html>
>
>
>
>
>
> Indexes <https://orc.apache.org/docs/indexes.html>
> Indexes ORC provides three level of indexes within each file: file level -
> statistics about the values in each column across the entire file
> View on orc.apache.org <https://orc.apache.org/docs/indexes.html>
> Preview by Yahoo
>
>
> I am confused as it is mixing up indexes with statistics. Can someone
> clarify these.
>
> Thanks
>
>
>
>
>

Re: ORC files and statistics

Posted by Ashok Kumar <as...@yahoo.com>.
Thank you both.
So if I have a Hive table of ORC type and it contains 100K rows, there will be 10 row groups of 10K row each.
within each row group there will be min, max, count(distint_value) and sum for each column within that row group. is count mean count of distinct values including null occurrence for that column?.
also if the table contains 5 columns will there be 5x10 row groups in total?
thanks again 

    On Tuesday, 19 January 2016, 17:35, Jörn Franke <jo...@gmail.com> wrote:
 

 Just be aware that you should insert the data sorted at least on the most discrimating column of your where clause
On 19 Jan 2016, at 17:27, Owen O'Malley <om...@apache.org> wrote:


It has both. Each index has statistics of min, max, count, and sum for each column in the row group of 10,000 rows. It also has the location of the start of each row group, so that the reader can jump straight to the beginning of the row group. The reader takes a SearchArgument (eg. age > 100)  that limits which rows are required for the query and can avoid reading an entire file, or at least sections of the file.
.. Owen
On Tue, Jan 19, 2016 at 7:50 AM, Ashok Kumar <as...@yahoo.com> wrote:

 Hi,
I have read some notes on ORC files in Hive and indexes.
The document describes in the indexes but makes reference to statistics
Indexes
|   |
|   |  |   |   |   |   |   |
| IndexesIndexes ORC provides three level of indexes within each file: file level - statistics about the values in each column across the entire file  |
|  |
| View on orc.apache.org | Preview by Yahoo |
|  |
|   |


I am confused as it is mixing up indexes with statistics. Can someone clarify these.
Thanks




  

Re: ORC files and statistics

Posted by Jörn Franke <jo...@gmail.com>.
Just be aware that you should insert the data sorted at least on the most discrimating column of your where clause

> On 19 Jan 2016, at 17:27, Owen O'Malley <om...@apache.org> wrote:
> 
> It has both. Each index has statistics of min, max, count, and sum for each column in the row group of 10,000 rows. It also has the location of the start of each row group, so that the reader can jump straight to the beginning of the row group. The reader takes a SearchArgument (eg. age > 100)  that limits which rows are required for the query and can avoid reading an entire file, or at least sections of the file.
> 
> .. Owen
> 
>> On Tue, Jan 19, 2016 at 7:50 AM, Ashok Kumar <as...@yahoo.com> wrote:
>> Hi,
>> 
>> I have read some notes on ORC files in Hive and indexes.
>> 
>> The document describes in the indexes but makes reference to statistics
>> 
>> Indexes
>>  
>>  
>> 
>>  
>>  
>>  
>>  
>>  
>> Indexes
>> Indexes ORC provides three level of indexes within each file: file level - statistics about the values in each column across the entire file
>> View on orc.apache.org
>> Preview by Yahoo
>>  
>> 
>> I am confused as it is mixing up indexes with statistics. Can someone clarify these.
>> 
>> Thanks
> 

Re: ORC files and statistics

Posted by Owen O'Malley <om...@apache.org>.
It has both. Each index has statistics of min, max, count, and sum for each
column in the row group of 10,000 rows. It also has the location of the
start of each row group, so that the reader can jump straight to the
beginning of the row group. The reader takes a SearchArgument (eg. age >
100)  that limits which rows are required for the query and can avoid
reading an entire file, or at least sections of the file.

.. Owen

On Tue, Jan 19, 2016 at 7:50 AM, Ashok Kumar <as...@yahoo.com> wrote:

> Hi,
>
> I have read some notes on ORC files in Hive and indexes.
>
> The document describes in the indexes but makes reference to statistics
>
> Indexes <https://orc.apache.org/docs/indexes.html>
>
>
> [image: image] <https://orc.apache.org/docs/indexes.html>
>
>
>
>
>
> Indexes <https://orc.apache.org/docs/indexes.html>
> Indexes ORC provides three level of indexes within each file: file level -
> statistics about the values in each column across the entire file
> View on orc.apache.org <https://orc.apache.org/docs/indexes.html>
> Preview by Yahoo
>
>
> I am confused as it is mixing up indexes with statistics. Can someone
> clarify these.
>
> Thanks
>