You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@orc.apache.org by Thomas Abeler <th...@sensenetworks.com> on 2015/07/16 18:07:48 UTC

ORC Indexing

Hey,



i have an question about how indexing in ORC works



The way I understood ORC indexing is, that ORC keeps statistics (min, max,
sum) about the rows every 10'000 rows (by default )and if I query the data
it looks at the statistics to figure out if it needs to read the row chunk
or not.



If that's true - is it possible to build an index on an ORC file that is
more similar to an database index - meaning that i want to create another
sorted data structure which holds the field value and a pointer to the
record it relates to.



The problem i have is that i have a huge dataset. >300TB and 69 columns.
There is no 'key' column that gets frequently queried and i would like to
perform ad-hoc queries on nearly every of these columns. I think building
an index on ever column would be a good approach to get this ability.



Regards,

Thomas

Re: ORC Indexing

Posted by Owen O'Malley <om...@apache.org>.

As Prasanth pointed out the bloom filters provide a low level index. They
are great for determining which sets of 10k rows (a row group) your query
needs to read. The size of the row group defaults to 10k rows, but is
settable via orc.row.index.stride.

Part of the ORC index to to record the start position of each column at the
row group boundary. So the seek to row in the RecordReader is efficient,
because it jumps to the 10k row offset and skips over the values to get to
the right row.

So you absolutely can build an external index yourself and keep the row
numbers within the file or you could design a high level index that maps
values to particular files and use the bloom filters to figure out which
row groups within the file you need to read.

.. Owen

On Thu, Jul 16, 2015 at 9:16 AM, Prasanth J <j....@gmail.com> wrote:

> Recently, bloom filter index is added to ORC which is much more accurate
> in row group elimination than min/max based index.
>
> Thanks
> Prasanth
>
> On Jul 16, 2015, at 9:07 AM, Thomas Abeler <th...@sensenetworks.com>
> wrote:
>
> Hey,
>
>
>
> i have an question about how indexing in ORC works
>
>
>
> The way I understood ORC indexing is, that ORC keeps statistics (min, max,
> sum) about the rows every 10'000 rows (by default )and if I query the data
> it looks at the statistics to figure out if it needs to read the row chunk
> or not.
>
>
>
> If that's true - is it possible to build an index on an ORC file that is
> more similar to an database index - meaning that i want to create another
> sorted data structure which holds the field value and a pointer to the
> record it relates to.
>
>
>
> The problem i have is that i have a huge dataset. >300TB and 69 columns.
> There is no 'key' column that gets frequently queried and i would like to
> perform ad-hoc queries on nearly every of these columns. I think building
> an index on ever column would be a good approach to get this ability.
>
>
>
> Regards,
>
> Thomas
>
>
>

Re: ORC Indexing

Posted by Prasanth J <j....@gmail.com>.

Recently, bloom filter index is added to ORC which is much more accurate in row group elimination than min/max based index.

Thanks
Prasanth

> On Jul 16, 2015, at 9:07 AM, Thomas Abeler <th...@sensenetworks.com> wrote:
> 
> Hey,
> 
>  
> 
> i have an question about how indexing in ORC works
> 
>  
> 
> The way I understood ORC indexing is, that ORC keeps statistics (min, max, sum) about the rows every 10'000 rows (by default )and if I query the data it looks at the statistics to figure out if it needs to read the row chunk or not.
> 
>  
> 
> If that's true - is it possible to build an index on an ORC file that is more similar to an database index - meaning that i want to create another sorted data structure which holds the field value and a pointer to the record it relates to.
> 
>  
> 
> The problem i have is that i have a huge dataset. >300TB and 69 columns. There is no 'key' column that gets frequently queried and i would like to perform ad-hoc queries on nearly every of these columns. I think building an index on ever column would be a good approach to get this ability.
> 
>  
> 
> Regards,
> 
> Thomas
>