You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Joseph <wx...@sina.com> on 2016/03/16 11:23:56 UTC

The build-in indexes in ORC file does not work.

Hi all,

I have known that ORC provides three level of indexes within each file, file level, stripe level, and row level. 
The file and stripe level statistics are in the file footer so that they are easy to access to determine if the rest of the file needs to be read at all. 
Row level indexes include both column statistics for each row group and position for seeking to the start of the row group. 

The following is my understanding:
1. The file and stripe level indexes are forcibly generated, we can not control them.
2. The row level indexes can be configured by "orc.create.index"(whether to create row indexes) and "orc.row.index.stride"(number of rows between index entries).
3. Each Index has statistics of min, max for each column, so sort data by the filter column will bring better performance.
4. To use any one of the three level of indexes,we should enable predicate push-down by setting spark.sql.orc.filterPushdown=true (in sparkSQL) or hive.optimize.ppd=true (in hive).

But I found the  build-in indexes in ORC files did not work both in spark 1.5.2 and hive 1.2.1:
First, when the query statement with where clause did't match any record (the filter column had a value beyond the range of data),  the performance when enabled  predicate push-down was almost the same with when disabled predicate push-down.  I think, when the filter column has a value beyond the range of data, all of the orc files will not be scanned if use file level indexes,  so the performance should improve obviously.

The second, when enabled "orc.create.index" and sorted data by filter column and where clause can only match a few records, the performance when enabled  predicate push-down was almost the same with when disabled predicate push-down. 

The third, when enabled  predicate push-down and "orc.create.index", the performance when  filter column had a value beyond the range of data was almost the same with when filter column had a value covering almost the whole data. 

So,  has anyone used ORC's build-in indexes before (especially in spark SQL)?  What's my issue?

Thanks!



Joseph

Re: The build-in indexes in ORC file does not work.

Posted by Mich Talebzadeh <mi...@gmail.com>.
Hi,

The parameters that control the stripe,  row group are configurable via the
ORC creation script

CREATE TABLE dummy (
     ID INT
   , CLUSTERED INT
   , SCATTERED INT
   , RANDOMISED INT
   , RANDOM_STRING VARCHAR(50)
   , SMALL_VC VARCHAR(10)
   , PADDING  VARCHAR(10)
)
CLUSTERED BY (ID) INTO 256 BUCKETS
STORED AS ORC
TBLPROPERTIES (
"orc.create.index"="true",
"orc.bloom.filter.columns"="ID",
"orc.bloom.filter.fpp"="0.05",
"orc.compress"="SNAPPY",
"orc.stripe.size"="16777216",
"orc.row.index.stride"="10000" )
;
So in here I make my stripe quite small 16MB (as opposed to default of
64MB) and give row.index.stride = 10000.

You can find out the available stats at row group for various
columns (0,1,2,3,...) by doing something like below

hive --orcfiledump --rowindex 0,1,2,3,4,5,6
/user/hive/warehouse/test.db/dummy/000000_0

In reality I have found out that the only occasion the stats are used
is when you actually bucket the table in ORC or use partition. There are
also dependencies on the block size etc as well and how many rows in each
block. If the whole table fits in a block size I believe the stats are
ignored (at least this was the case in older versions of Hive (I use Hive 2)

check the optimiser plan with

explain extended   <YOUR_CODE>

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 16 March 2016 at 10:23, Joseph <wx...@sina.com> wrote:

> Hi all,
>
> I have known that ORC provides three level of indexes within each file,
> *file level, stripe level, and row level*.
>
> The file and stripe level statistics are in the file footer so that they are easy to access to determine if the rest of the file needs to be read at all.
>
> Row level indexes include both column statistics for each row group and position for seeking to the start of the row group.
>
> The following is my understanding:
>
> 1. The file and stripe level indexes are forcibly generated, we can not control them.
>
> 2. The row level indexes can be configured by "orc.create.index"(whether to create row indexes) and "orc.row.index.stride"(number of rows between index entries).
>
> 3. Each Index has statistics of min, max for each column, so sort data by the filter column will bring better performance.
>
> 4. To use any one of the three level of indexes,we should enable predicate push-down by setting
> *spark.sql.orc.filterPushdown=true* (in sparkSQL) or
> *hive.optimize.ppd=true* (in hive).
>
> *But I found the ** build-in **indexes in ORC files
> did not work both in spark 1.5.2 and hive 1.2.1:*
> First, when the query statement with where clause did't match any record (the filter column had a value beyond the range of data),  the performance when enabled
>  predicate push-down was almost the same with when disabled predicate push-down.  I think, when the filter column has a value beyond the range of data, all of the orc files will not be scanned if use file level indexes,  so the performance should improve obviously.
>
> The second, when enabled "orc.create.index" and sorted data by filter
> column and where clause can only match a few records, the performance when enabled
>  predicate push-down was almost the same with when disabled predicate push-down.
>
> The third, when enabled  predicate push-down and "orc.create.index", the performance when
>  filter column had a value beyond the range of data
> was almost the same with when filter column had a value covering almost
> the whole data.
>
> So,  has anyone used ORC's build-in indexes before (especially in spark
> SQL)?  What's my issue?
>
> Thanks!
>
> ------------------------------
> Joseph
>

Re: The build-in indexes in ORC file does not work.

Posted by Mich Talebzadeh <mi...@gmail.com>.
I did some tests on Hive running on MR to get rid of Spark effects.

In an ORC table that has been partitioned, partition elimination with
predicate push down works and the query is narrowed to the partition
itself. I can see that from the number of rows within that partition.

For example below sales table is ORC partitioned by year and month. For
year = 1999 and month = 8 , there are  124,284 rows

explain extended select count(1) from sales where year = 1999 and month =8;
        Map 1
            Map Operator Tree:
                TableScan
                  alias: sales
                  Statistics: *Num rows: 124284* Data size: 1184497 Basic
stats: COMPLETE Column stats: NONE
                  GatherStats: false
                  Select Operator
                    Statistics: Num rows: 124284 Data size: 1184497 Basic
stats: COMPLETE Column stats: NONE
                    Group By Operator

That is the only time I have seen through explain plan that partition
elimination is working.


HTH






Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 16 March 2016 at 20:57, Wietsma, Tristan A. <
Tristan.Wietsma@capitalone.com> wrote:

> Regarding bloom filters,
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-12417
>
>
>
> Sent with Good (www.good.com)
> ------------------------------
> *From:* Joseph <wx...@sina.com>
> *Sent:* Wednesday, March 16, 2016 9:46:25 AM
> *To:* user
> *Cc:* user; user
> *Subject:* Re: Re: The build-in indexes in ORC file does not work.
>
>
> terminal_type =0,  260,000,000 rows,  almost cover half of the whole data.
>
> terminal_type =25066, just 3800 rows.
>
>
>
> orc tblproperties("orc.compress"="SNAPPY","orc.compress.size"="262141","orc.stripe.size"="268435456","orc.row.index.stride"="
> *100000*","orc.create.index"="true","orc.bloom.filter.columns"="");
>
>
> The table "gprs" has sorted by terminal_type.  Before sort, I have
> another table named "gprs_orc", I use sparkSQL to sort the data as follows:
>
> (before do this, I set  hive.enforce.sorting=true)
>
> sql> INSERT INTO TABLE gprs SELECT * FROM gprs_orc sort by terminal_type ;
>
> Because the table gprs_orc has 800 files, so generate 800 Tasks, and
> create 800 files also in table gprs. But I am not sure whether each file be
> sorted separately or not.
>
>
> I have tried  bloom filter ,but it makes no improvement。I know about tez,
> but never use, I will try it later.
>
>
> The following is my test in hive 1.2.1:
>
> 1. enable *hive.optimize.index.filter* and *hive.optimize.ppd:*
>
>     select count(*) from gprs where terminal_type=25080;    will not scan
> data                   Time taken: 353.345 seconds
>
>     select count(*) from gprs where terminal_type=25066;    just scan a
> few row groups    Time taken:  354.860 seconds
>
>     select count(*) from gprs where terminal_type=0;            scan half
> of the data              Time taken:  378.312 seconds
>
>
> 2. *disable *hive.optimize.index.filter and hive.optimize.ppd:
>
>       select count(*) from gprs where terminal_type=25080;   scan all the
> data                      Time taken: 389.700 seconds
>
>     select count(*) from gprs where terminal_type=25066;   scan all the
> data                      Time taken:  386.600 seconds
>
>     select count(*) from gprs where terminal_type=0;            scan all
> the data                     Time taken:  395.968 seconds
>
>
> The following is my environment:
>       3 nodes,    12 cpu cores per node,    48G memory free per node,   4
> disks per node,  3 replications per block , hadoop 2.7.2,    hive 1.2.1
>
>
> ------------------------------
> Joseph
>
>
> *From:* Jörn Franke <jo...@gmail.com>
> *Date:* 2016-03-16 20:27
> *To:* Joseph <wx...@sina.com>
> *CC:* user <us...@spark.apache.org>; user <us...@hive.apache.org>
> *Subject:* Re: The build-in indexes in ORC file does not work.
> Not sure it should work. How many rows are affected? The data is sorted?
> Have you tried with Tez? Tez has some summary statistics that tells you if
> you use push down. Maybe you need to use HiveContext.
> Perhaps a bloom filter could make sense for you as well.
>
> On 16 Mar 2016, at 12:45, Joseph <wx...@sina.com> wrote:
>
> Hi,
>
> I have only one table named "gprs",  it has 560,000,000 rows,  and 57
> columns.  The block size is 256M,  total ORC file number is 800, each of
> them is about 51M.
>
> my query statement is :
> select count(*) from gprs  where  terminal_type = 25080;
> select * from gprs  where  terminal_type = 25080;
>
> In the gprs table, the "terminal_type"  column's  value is in [0, 25066]
>
> ------------------------------
> Joseph
>
>
> *From:* Jörn Franke <jo...@gmail.com>
> *Date:* 2016-03-16 19:26
> *To:* Joseph <wx...@sina.com>
> *CC:* user <us...@spark.apache.org>; user <us...@hive.apache.org>
> *Subject:* Re: The build-in indexes in ORC file does not work.
> How much data are you querying? What is the query? How selective it is
> supposed to be? What is the block size?
>
> On 16 Mar 2016, at 11:23, Joseph <wx...@sina.com> wrote:
>
> Hi all,
>
> I have known that ORC provides three level of indexes within each file,
> *file level, stripe level, and row level*.
>
> The file and stripe level statistics are in the file footer so that they are easy to access to determine if the rest of the file needs to be read at all.
>
> Row level indexes include both column statistics for each row group and position for seeking to the start of the row group.
>
> The following is my understanding:
>
> 1. The file and stripe level indexes are forcibly generated, we can not control them.
>
> 2. The row level indexes can be configured by "orc.create.index"(whether to create row indexes) and "orc.row.index.stride"(number of rows between index entries).
>
> 3. Each Index has statistics of min, max for each column, so sort data by the filter column will bring better performance.
>
> 4. To use any one of the three level of indexes,we should enable predicate push-down by setting
> *spark.sql.orc.filterPushdown=true* (in sparkSQL) or
> *hive.optimize.ppd=true* (in hive).
>
> *But I found the ** build-in **indexes in ORC files
> did not work both in spark 1.5.2 and hive 1.2.1:*
> First, when the query statement with where clause did't match any record (the filter column had a value beyond the range of data),  the performance when enabled
>  predicate push-down was almost the same with when disabled predicate push-down.  I think, when the filter column has a value beyond the range of data, all of the orc files will not be scanned if use file level indexes,  so the performance should improve obviously.
>
> The second, when enabled "orc.create.index" and sorted data by filter
> column and where clause can only match a few records, the performance when enabled
>  predicate push-down was almost the same with when disabled predicate push-down.
>
> The third, when enabled  predicate push-down and "orc.create.index", the performance when
>  filter column had a value beyond the range of data
> was almost the same with when filter column had a value covering almost
> the whole data.
>
> So,  has anyone used ORC's build-in indexes before (especially in spark
> SQL)?  What's my issue?
>
> Thanks!
>
> ------------------------------
> Joseph
>
>
> ------------------------------
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>

Re: The build-in indexes in ORC file does not work.

Posted by Mich Talebzadeh <mi...@gmail.com>.
I did some tests on Hive running on MR to get rid of Spark effects.

In an ORC table that has been partitioned, partition elimination with
predicate push down works and the query is narrowed to the partition
itself. I can see that from the number of rows within that partition.

For example below sales table is ORC partitioned by year and month. For
year = 1999 and month = 8 , there are  124,284 rows

explain extended select count(1) from sales where year = 1999 and month =8;
        Map 1
            Map Operator Tree:
                TableScan
                  alias: sales
                  Statistics: *Num rows: 124284* Data size: 1184497 Basic
stats: COMPLETE Column stats: NONE
                  GatherStats: false
                  Select Operator
                    Statistics: Num rows: 124284 Data size: 1184497 Basic
stats: COMPLETE Column stats: NONE
                    Group By Operator

That is the only time I have seen through explain plan that partition
elimination is working.


HTH






Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 16 March 2016 at 20:57, Wietsma, Tristan A. <
Tristan.Wietsma@capitalone.com> wrote:

> Regarding bloom filters,
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-12417
>
>
>
> Sent with Good (www.good.com)
> ------------------------------
> *From:* Joseph <wx...@sina.com>
> *Sent:* Wednesday, March 16, 2016 9:46:25 AM
> *To:* user
> *Cc:* user; user
> *Subject:* Re: Re: The build-in indexes in ORC file does not work.
>
>
> terminal_type =0,  260,000,000 rows,  almost cover half of the whole data.
>
> terminal_type =25066, just 3800 rows.
>
>
>
> orc tblproperties("orc.compress"="SNAPPY","orc.compress.size"="262141","orc.stripe.size"="268435456","orc.row.index.stride"="
> *100000*","orc.create.index"="true","orc.bloom.filter.columns"="");
>
>
> The table "gprs" has sorted by terminal_type.  Before sort, I have
> another table named "gprs_orc", I use sparkSQL to sort the data as follows:
>
> (before do this, I set  hive.enforce.sorting=true)
>
> sql> INSERT INTO TABLE gprs SELECT * FROM gprs_orc sort by terminal_type ;
>
> Because the table gprs_orc has 800 files, so generate 800 Tasks, and
> create 800 files also in table gprs. But I am not sure whether each file be
> sorted separately or not.
>
>
> I have tried  bloom filter ,but it makes no improvement。I know about tez,
> but never use, I will try it later.
>
>
> The following is my test in hive 1.2.1:
>
> 1. enable *hive.optimize.index.filter* and *hive.optimize.ppd:*
>
>     select count(*) from gprs where terminal_type=25080;    will not scan
> data                   Time taken: 353.345 seconds
>
>     select count(*) from gprs where terminal_type=25066;    just scan a
> few row groups    Time taken:  354.860 seconds
>
>     select count(*) from gprs where terminal_type=0;            scan half
> of the data              Time taken:  378.312 seconds
>
>
> 2. *disable *hive.optimize.index.filter and hive.optimize.ppd:
>
>       select count(*) from gprs where terminal_type=25080;   scan all the
> data                      Time taken: 389.700 seconds
>
>     select count(*) from gprs where terminal_type=25066;   scan all the
> data                      Time taken:  386.600 seconds
>
>     select count(*) from gprs where terminal_type=0;            scan all
> the data                     Time taken:  395.968 seconds
>
>
> The following is my environment:
>       3 nodes,    12 cpu cores per node,    48G memory free per node,   4
> disks per node,  3 replications per block , hadoop 2.7.2,    hive 1.2.1
>
>
> ------------------------------
> Joseph
>
>
> *From:* Jörn Franke <jo...@gmail.com>
> *Date:* 2016-03-16 20:27
> *To:* Joseph <wx...@sina.com>
> *CC:* user <us...@spark.apache.org>; user <us...@hive.apache.org>
> *Subject:* Re: The build-in indexes in ORC file does not work.
> Not sure it should work. How many rows are affected? The data is sorted?
> Have you tried with Tez? Tez has some summary statistics that tells you if
> you use push down. Maybe you need to use HiveContext.
> Perhaps a bloom filter could make sense for you as well.
>
> On 16 Mar 2016, at 12:45, Joseph <wx...@sina.com> wrote:
>
> Hi,
>
> I have only one table named "gprs",  it has 560,000,000 rows,  and 57
> columns.  The block size is 256M,  total ORC file number is 800, each of
> them is about 51M.
>
> my query statement is :
> select count(*) from gprs  where  terminal_type = 25080;
> select * from gprs  where  terminal_type = 25080;
>
> In the gprs table, the "terminal_type"  column's  value is in [0, 25066]
>
> ------------------------------
> Joseph
>
>
> *From:* Jörn Franke <jo...@gmail.com>
> *Date:* 2016-03-16 19:26
> *To:* Joseph <wx...@sina.com>
> *CC:* user <us...@spark.apache.org>; user <us...@hive.apache.org>
> *Subject:* Re: The build-in indexes in ORC file does not work.
> How much data are you querying? What is the query? How selective it is
> supposed to be? What is the block size?
>
> On 16 Mar 2016, at 11:23, Joseph <wx...@sina.com> wrote:
>
> Hi all,
>
> I have known that ORC provides three level of indexes within each file,
> *file level, stripe level, and row level*.
>
> The file and stripe level statistics are in the file footer so that they are easy to access to determine if the rest of the file needs to be read at all.
>
> Row level indexes include both column statistics for each row group and position for seeking to the start of the row group.
>
> The following is my understanding:
>
> 1. The file and stripe level indexes are forcibly generated, we can not control them.
>
> 2. The row level indexes can be configured by "orc.create.index"(whether to create row indexes) and "orc.row.index.stride"(number of rows between index entries).
>
> 3. Each Index has statistics of min, max for each column, so sort data by the filter column will bring better performance.
>
> 4. To use any one of the three level of indexes,we should enable predicate push-down by setting
> *spark.sql.orc.filterPushdown=true* (in sparkSQL) or
> *hive.optimize.ppd=true* (in hive).
>
> *But I found the ** build-in **indexes in ORC files
> did not work both in spark 1.5.2 and hive 1.2.1:*
> First, when the query statement with where clause did't match any record (the filter column had a value beyond the range of data),  the performance when enabled
>  predicate push-down was almost the same with when disabled predicate push-down.  I think, when the filter column has a value beyond the range of data, all of the orc files will not be scanned if use file level indexes,  so the performance should improve obviously.
>
> The second, when enabled "orc.create.index" and sorted data by filter
> column and where clause can only match a few records, the performance when enabled
>  predicate push-down was almost the same with when disabled predicate push-down.
>
> The third, when enabled  predicate push-down and "orc.create.index", the performance when
>  filter column had a value beyond the range of data
> was almost the same with when filter column had a value covering almost
> the whole data.
>
> So,  has anyone used ORC's build-in indexes before (especially in spark
> SQL)?  What's my issue?
>
> Thanks!
>
> ------------------------------
> Joseph
>
>
> ------------------------------
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>

RE: The build-in indexes in ORC file does not work.

Posted by "Wietsma, Tristan A." <Tr...@capitalone.com>.
Regarding bloom filters, https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-12417



Sent with Good (www.good.com)
________________________________
From: Joseph <wx...@sina.com>
Sent: Wednesday, March 16, 2016 9:46:25 AM
To: user
Cc: user; user
Subject: Re: Re: The build-in indexes in ORC file does not work.

terminal_type =0,  260,000,000 rows,  almost cover half of the whole data.
terminal_type =25066, just 3800 rows.

orc tblproperties("orc.compress"="SNAPPY","orc.compress.size"="262141","orc.stripe.size"="268435456","orc.row.index.stride"="100000","orc.create.index"="true","orc.bloom.filter.columns"="");

The table "gprs" has sorted by terminal_type.  Before sort, I have another table named "gprs_orc", I use sparkSQL to sort the data as follows:
(before do this, I set  hive.enforce.sorting=true)
sql> INSERT INTO TABLE gprs SELECT * FROM gprs_orc sort by terminal_type ;
Because the table gprs_orc has 800 files, so generate 800 Tasks, and create 800 files also in table gprs. But I am not sure whether each file be sorted separately or not.

I have tried  bloom filter ,but it makes no improvement。I know about tez, but never use, I will try it later.

The following is my test in hive 1.2.1:
1. enable hive.optimize.index.filter and hive.optimize.ppd:
    select count(*) from gprs where terminal_type=25080;    will not scan data                   Time taken: 353.345 seconds
    select count(*) from gprs where terminal_type=25066;    just scan a few row groups    Time taken:  354.860 seconds
    select count(*) from gprs where terminal_type=0;            scan half of the data              Time taken:  378.312 seconds

2. disable hive.optimize.index.filter and hive.optimize.ppd:
      select count(*) from gprs where terminal_type=25080;   scan all the data                      Time taken: 389.700 seconds
    select count(*) from gprs where terminal_type=25066;   scan all the data                      Time taken:  386.600 seconds
    select count(*) from gprs where terminal_type=0;            scan all the data                     Time taken:  395.968 seconds

The following is my environment:
      3 nodes,    12 cpu cores per node,    48G memory free per node,   4 disks per node,  3 replications per block , hadoop 2.7.2,    hive 1.2.1


________________________________
Joseph

From: Jörn Franke<ma...@gmail.com>
Date: 2016-03-16 20:27
To: Joseph<ma...@sina.com>
CC: user<ma...@spark.apache.org>; user<ma...@hive.apache.org>
Subject: Re: The build-in indexes in ORC file does not work.
Not sure it should work. How many rows are affected? The data is sorted?
Have you tried with Tez? Tez has some summary statistics that tells you if you use push down. Maybe you need to use HiveContext.
Perhaps a bloom filter could make sense for you as well.

On 16 Mar 2016, at 12:45, Joseph <wx...@sina.com>> wrote:

Hi,

I have only one table named "gprs",  it has 560,000,000 rows,  and 57 columns.  The block size is 256M,  total ORC file number is 800, each of them is about 51M.

my query statement is :
select count(*) from gprs  where  terminal_type = 25080;
select * from gprs  where  terminal_type = 25080;

In the gprs table, the "terminal_type"  column's  value is in [0, 25066]

________________________________
Joseph

From: Jörn Franke<ma...@gmail.com>
Date: 2016-03-16 19:26
To: Joseph<ma...@sina.com>
CC: user<ma...@spark.apache.org>; user<ma...@hive.apache.org>
Subject: Re: The build-in indexes in ORC file does not work.
How much data are you querying? What is the query? How selective it is supposed to be? What is the block size?

On 16 Mar 2016, at 11:23, Joseph <wx...@sina.com>> wrote:

Hi all,

I have known that ORC provides three level of indexes within each file, file level, stripe level, and row level.
The file and stripe level statistics are in the file footer so that they are easy to access to determine if the rest of the file needs to be read at all.
Row level indexes include both column statistics for each row group and position for seeking to the start of the row group.

The following is my understanding:
1. The file and stripe level indexes are forcibly generated, we can not control them.
2. The row level indexes can be configured by "orc.create.index"(whether to create row indexes) and "orc.row.index.stride"(number of rows between index entries).
3. Each Index has statistics of min, max for each column, so sort data by the filter column will bring better performance.
4. To use any one of the three level of indexes,we should enable predicate push-down by setting spark.sql.orc.filterPushdown=true (in sparkSQL) or hive.optimize.ppd=true (in hive).

But I found the  build-in indexes in ORC files did not work both in spark 1.5.2 and hive 1.2.1:
First, when the query statement with where clause did't match any record (the filter column had a value beyond the range of data),  the performance when enabled  predicate push-down was almost the same with when disabled predicate push-down.  I think, when the filter column has a value beyond the range of data, all of the orc files will not be scanned if use file level indexes,  so the performance should improve obviously.

The second, when enabled "orc.create.index" and sorted data by filter column and where clause can only match a few records, the performance when enabled  predicate push-down was almost the same with when disabled predicate push-down.

The third, when enabled  predicate push-down and "orc.create.index", the performance when  filter column had a value beyond the range of data was almost the same with when filter column had a value covering almost the whole data.

So,  has anyone used ORC's build-in indexes before (especially in spark SQL)?  What's my issue?

Thanks!

________________________________
Joseph
________________________________________________________

The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.

Re: Re: The build-in indexes in ORC file does not work.

Posted by Joseph <wx...@sina.com>.
terminal_type =0,  260,000,000 rows,  almost cover half of the whole data.terminal_type =25066, just 3800 rows.
orc tblproperties("orc.compress"="SNAPPY","orc.compress.size"="262141","orc.stripe.size"="268435456","orc.row.index.stride"="100000","orc.create.index"="true","orc.bloom.filter.columns"="");
The table "gprs" has sorted by terminal_type.  Before sort, I have another table named "gprs_orc", I use sparkSQL to sort the data as follows:(before do this, I set  hive.enforce.sorting=true)sql> INSERT INTO TABLE gprs SELECT * FROM gprs_orc sort by terminal_type ;Because the table gprs_orc has 800 files, so generate 800 Tasks, and create 800 files also in table gprs. But I am not sure whether each file be sorted separately or not.
I have tried  bloom filter ,but it makes no improvement。I know about tez, but never use, I will try it later.
The following is my test in hive 1.2.1: 1. enable hive.optimize.index.filter and hive.optimize.ppd:    select count(*) from gprs where terminal_type=25080;    will not scan data                   Time taken: 353.345 seconds    select count(*) from gprs where terminal_type=25066;    just scan a few row groups    Time taken:  354.860 seconds    select count(*) from gprs where terminal_type=0;            scan half of the data              Time taken:  378.312 seconds
2. disable hive.optimize.index.filter and hive.optimize.ppd:      select count(*) from gprs where terminal_type=25080;   scan all the data                      Time taken: 389.700 seconds
    select count(*) from gprs where terminal_type=25066;   scan all the data                      Time taken:  386.600 seconds
    select count(*) from gprs where terminal_type=0;            scan all the data                     Time taken:  395.968 seconds

The following is my environment:
      3 nodes,    12 cpu cores per node,    48G memory free per node,   4 disks per node,  3 replications per block , hadoop 2.7.2,    hive 1.2.1




Joseph
 
From: Jörn Franke
Date: 2016-03-16 20:27
To: Joseph
CC: user; user
Subject: Re: The build-in indexes in ORC file does not work.
Not sure it should work. How many rows are affected? The data is sorted?
Have you tried with Tez? Tez has some summary statistics that tells you if you use push down. Maybe you need to use HiveContext.
Perhaps a bloom filter could make sense for you as well.

On 16 Mar 2016, at 12:45, Joseph <wx...@sina.com> wrote:

Hi,

I have only one table named "gprs",  it has 560,000,000 rows,  and 57 columns.  The block size is 256M,  total ORC file number is 800, each of them is about 51M.

my query statement is :
select count(*) from gprs  where  terminal_type = 25080;
select * from gprs  where  terminal_type = 25080;

In the gprs table, the "terminal_type"  column's  value is in [0, 25066]



Joseph
 
From: Jörn Franke
Date: 2016-03-16 19:26
To: Joseph
CC: user; user
Subject: Re: The build-in indexes in ORC file does not work.
How much data are you querying? What is the query? How selective it is supposed to be? What is the block size?

On 16 Mar 2016, at 11:23, Joseph <wx...@sina.com> wrote:

Hi all,

I have known that ORC provides three level of indexes within each file, file level, stripe level, and row level. 
The file and stripe level statistics are in the file footer so that they are easy to access to determine if the rest of the file needs to be read at all. 
Row level indexes include both column statistics for each row group and position for seeking to the start of the row group. 

The following is my understanding:
1. The file and stripe level indexes are forcibly generated, we can not control them.
2. The row level indexes can be configured by "orc.create.index"(whether to create row indexes) and "orc.row.index.stride"(number of rows between index entries).
3. Each Index has statistics of min, max for each column, so sort data by the filter column will bring better performance.
4. To use any one of the three level of indexes,we should enable predicate push-down by setting spark.sql.orc.filterPushdown=true (in sparkSQL) or hive.optimize.ppd=true (in hive).

But I found the  build-in indexes in ORC files did not work both in spark 1.5.2 and hive 1.2.1:
First, when the query statement with where clause did't match any record (the filter column had a value beyond the range of data),  the performance when enabled  predicate push-down was almost the same with when disabled predicate push-down.  I think, when the filter column has a value beyond the range of data, all of the orc files will not be scanned if use file level indexes,  so the performance should improve obviously.

The second, when enabled "orc.create.index" and sorted data by filter column and where clause can only match a few records, the performance when enabled  predicate push-down was almost the same with when disabled predicate push-down. 

The third, when enabled  predicate push-down and "orc.create.index", the performance when  filter column had a value beyond the range of data was almost the same with when filter column had a value covering almost the whole data. 

So,  has anyone used ORC's build-in indexes before (especially in spark SQL)?  What's my issue?

Thanks!



Joseph

Re: Re: The build-in indexes in ORC file does not work.

Posted by Joseph <wx...@sina.com>.
terminal_type =0,  260,000,000 rows,  almost cover half of the whole data.terminal_type =25066, just 3800 rows.
orc tblproperties("orc.compress"="SNAPPY","orc.compress.size"="262141","orc.stripe.size"="268435456","orc.row.index.stride"="100000","orc.create.index"="true","orc.bloom.filter.columns"="");
The table "gprs" has sorted by terminal_type.  Before sort, I have another table named "gprs_orc", I use sparkSQL to sort the data as follows:(before do this, I set  hive.enforce.sorting=true)sql> INSERT INTO TABLE gprs SELECT * FROM gprs_orc sort by terminal_type ;Because the table gprs_orc has 800 files, so generate 800 Tasks, and create 800 files also in table gprs. But I am not sure whether each file be sorted separately or not.
I have tried  bloom filter ,but it makes no improvement。I know about tez, but never use, I will try it later.
The following is my test in hive 1.2.1: 1. enable hive.optimize.index.filter and hive.optimize.ppd:    select count(*) from gprs where terminal_type=25080;    will not scan data                   Time taken: 353.345 seconds    select count(*) from gprs where terminal_type=25066;    just scan a few row groups    Time taken:  354.860 seconds    select count(*) from gprs where terminal_type=0;            scan half of the data              Time taken:  378.312 seconds
2. disable hive.optimize.index.filter and hive.optimize.ppd:      select count(*) from gprs where terminal_type=25080;   scan all the data                      Time taken: 389.700 seconds
    select count(*) from gprs where terminal_type=25066;   scan all the data                      Time taken:  386.600 seconds
    select count(*) from gprs where terminal_type=0;            scan all the data                     Time taken:  395.968 seconds

The following is my environment:
      3 nodes,    12 cpu cores per node,    48G memory free per node,   4 disks per node,  3 replications per block , hadoop 2.7.2,    hive 1.2.1




Joseph
 
From: Jörn Franke
Date: 2016-03-16 20:27
To: Joseph
CC: user; user
Subject: Re: The build-in indexes in ORC file does not work.
Not sure it should work. How many rows are affected? The data is sorted?
Have you tried with Tez? Tez has some summary statistics that tells you if you use push down. Maybe you need to use HiveContext.
Perhaps a bloom filter could make sense for you as well.

On 16 Mar 2016, at 12:45, Joseph <wx...@sina.com> wrote:

Hi,

I have only one table named "gprs",  it has 560,000,000 rows,  and 57 columns.  The block size is 256M,  total ORC file number is 800, each of them is about 51M.

my query statement is :
select count(*) from gprs  where  terminal_type = 25080;
select * from gprs  where  terminal_type = 25080;

In the gprs table, the "terminal_type"  column's  value is in [0, 25066]



Joseph
 
From: Jörn Franke
Date: 2016-03-16 19:26
To: Joseph
CC: user; user
Subject: Re: The build-in indexes in ORC file does not work.
How much data are you querying? What is the query? How selective it is supposed to be? What is the block size?

On 16 Mar 2016, at 11:23, Joseph <wx...@sina.com> wrote:

Hi all,

I have known that ORC provides three level of indexes within each file, file level, stripe level, and row level. 
The file and stripe level statistics are in the file footer so that they are easy to access to determine if the rest of the file needs to be read at all. 
Row level indexes include both column statistics for each row group and position for seeking to the start of the row group. 

The following is my understanding:
1. The file and stripe level indexes are forcibly generated, we can not control them.
2. The row level indexes can be configured by "orc.create.index"(whether to create row indexes) and "orc.row.index.stride"(number of rows between index entries).
3. Each Index has statistics of min, max for each column, so sort data by the filter column will bring better performance.
4. To use any one of the three level of indexes,we should enable predicate push-down by setting spark.sql.orc.filterPushdown=true (in sparkSQL) or hive.optimize.ppd=true (in hive).

But I found the  build-in indexes in ORC files did not work both in spark 1.5.2 and hive 1.2.1:
First, when the query statement with where clause did't match any record (the filter column had a value beyond the range of data),  the performance when enabled  predicate push-down was almost the same with when disabled predicate push-down.  I think, when the filter column has a value beyond the range of data, all of the orc files will not be scanned if use file level indexes,  so the performance should improve obviously.

The second, when enabled "orc.create.index" and sorted data by filter column and where clause can only match a few records, the performance when enabled  predicate push-down was almost the same with when disabled predicate push-down. 

The third, when enabled  predicate push-down and "orc.create.index", the performance when  filter column had a value beyond the range of data was almost the same with when filter column had a value covering almost the whole data. 

So,  has anyone used ORC's build-in indexes before (especially in spark SQL)?  What's my issue?

Thanks!



Joseph

Re: The build-in indexes in ORC file does not work.

Posted by Jörn Franke <jo...@gmail.com>.
Not sure it should work. How many rows are affected? The data is sorted?
Have you tried with Tez? Tez has some summary statistics that tells you if you use push down. Maybe you need to use HiveContext.
Perhaps a bloom filter could make sense for you as well.

> On 16 Mar 2016, at 12:45, Joseph <wx...@sina.com> wrote:
> 
> Hi,
> 
> I have only one table named "gprs",  it has 560,000,000 rows,  and 57 columns.  The block size is 256M,  total ORC file number is 800, each of them is about 51M.
> 
> my query statement is :
> select count(*) from gprs  where  terminal_type = 25080;
> select * from gprs  where  terminal_type = 25080;
> 
> In the gprs table, the "terminal_type"  column's  value is in [0, 25066]
> 
> Joseph
>  
> From: Jörn Franke
> Date: 2016-03-16 19:26
> To: Joseph
> CC: user; user
> Subject: Re: The build-in indexes in ORC file does not work.
> How much data are you querying? What is the query? How selective it is supposed to be? What is the block size?
> 
>> On 16 Mar 2016, at 11:23, Joseph <wx...@sina.com> wrote:
>> 
>> Hi all,
>> 
>> I have known that ORC provides three level of indexes within each file, file level, stripe level, and row level. 
>> The file and stripe level statistics are in the file footer so that they are easy to access to determine if the rest of the file needs to be read at all. 
>> Row level indexes include both column statistics for each row group and position for seeking to the start of the row group. 
>> 
>> The following is my understanding:
>> 1. The file and stripe level indexes are forcibly generated, we can not control them.
>> 2. The row level indexes can be configured by "orc.create.index"(whether to create row indexes) and "orc.row.index.stride"(number of rows between index entries).
>> 3. Each Index has statistics of min, max for each column, so sort data by the filter column will bring better performance.
>> 4. To use any one of the three level of indexes,we should enable predicate push-down by setting spark.sql.orc.filterPushdown=true (in sparkSQL) or hive.optimize.ppd=true (in hive).
>> 
>> But I found the  build-in indexes in ORC files did not work both in spark 1.5.2 and hive 1.2.1:
>> First, when the query statement with where clause did't match any record (the filter column had a value beyond the range of data),  the performance when enabled  predicate push-down was almost the same with when disabled predicate push-down.  I think, when the filter column has a value beyond the range of data, all of the orc files will not be scanned if use file level indexes,  so the performance should improve obviously.
>> 
>> The second, when enabled "orc.create.index" and sorted data by filter column and where clause can only match a few records, the performance when enabled  predicate push-down was almost the same with when disabled predicate push-down. 
>> 
>> The third, when enabled  predicate push-down and "orc.create.index", the performance when  filter column had a value beyond the range of data was almost the same with when filter column had a value covering almost the whole data. 
>> 
>> So,  has anyone used ORC's build-in indexes before (especially in spark SQL)?  What's my issue?
>> 
>> Thanks!
>> 
>> Joseph

Re: The build-in indexes in ORC file does not work.

Posted by Jörn Franke <jo...@gmail.com>.
Not sure it should work. How many rows are affected? The data is sorted?
Have you tried with Tez? Tez has some summary statistics that tells you if you use push down. Maybe you need to use HiveContext.
Perhaps a bloom filter could make sense for you as well.

> On 16 Mar 2016, at 12:45, Joseph <wx...@sina.com> wrote:
> 
> Hi,
> 
> I have only one table named "gprs",  it has 560,000,000 rows,  and 57 columns.  The block size is 256M,  total ORC file number is 800, each of them is about 51M.
> 
> my query statement is :
> select count(*) from gprs  where  terminal_type = 25080;
> select * from gprs  where  terminal_type = 25080;
> 
> In the gprs table, the "terminal_type"  column's  value is in [0, 25066]
> 
> Joseph
>  
> From: Jörn Franke
> Date: 2016-03-16 19:26
> To: Joseph
> CC: user; user
> Subject: Re: The build-in indexes in ORC file does not work.
> How much data are you querying? What is the query? How selective it is supposed to be? What is the block size?
> 
>> On 16 Mar 2016, at 11:23, Joseph <wx...@sina.com> wrote:
>> 
>> Hi all,
>> 
>> I have known that ORC provides three level of indexes within each file, file level, stripe level, and row level. 
>> The file and stripe level statistics are in the file footer so that they are easy to access to determine if the rest of the file needs to be read at all. 
>> Row level indexes include both column statistics for each row group and position for seeking to the start of the row group. 
>> 
>> The following is my understanding:
>> 1. The file and stripe level indexes are forcibly generated, we can not control them.
>> 2. The row level indexes can be configured by "orc.create.index"(whether to create row indexes) and "orc.row.index.stride"(number of rows between index entries).
>> 3. Each Index has statistics of min, max for each column, so sort data by the filter column will bring better performance.
>> 4. To use any one of the three level of indexes,we should enable predicate push-down by setting spark.sql.orc.filterPushdown=true (in sparkSQL) or hive.optimize.ppd=true (in hive).
>> 
>> But I found the  build-in indexes in ORC files did not work both in spark 1.5.2 and hive 1.2.1:
>> First, when the query statement with where clause did't match any record (the filter column had a value beyond the range of data),  the performance when enabled  predicate push-down was almost the same with when disabled predicate push-down.  I think, when the filter column has a value beyond the range of data, all of the orc files will not be scanned if use file level indexes,  so the performance should improve obviously.
>> 
>> The second, when enabled "orc.create.index" and sorted data by filter column and where clause can only match a few records, the performance when enabled  predicate push-down was almost the same with when disabled predicate push-down. 
>> 
>> The third, when enabled  predicate push-down and "orc.create.index", the performance when  filter column had a value beyond the range of data was almost the same with when filter column had a value covering almost the whole data. 
>> 
>> So,  has anyone used ORC's build-in indexes before (especially in spark SQL)?  What's my issue?
>> 
>> Thanks!
>> 
>> Joseph

Re: The build-in indexes in ORC file does not work.

Posted by Jörn Franke <jo...@gmail.com>.
How much data are you querying? What is the query? How selective it is supposed to be? What is the block size?

> On 16 Mar 2016, at 11:23, Joseph <wx...@sina.com> wrote:
> 
> Hi all,
> 
> I have known that ORC provides three level of indexes within each file, file level, stripe level, and row level. 
> The file and stripe level statistics are in the file footer so that they are easy to access to determine if the rest of the file needs to be read at all. 
> Row level indexes include both column statistics for each row group and position for seeking to the start of the row group. 
> 
> The following is my understanding:
> 1. The file and stripe level indexes are forcibly generated, we can not control them.
> 2. The row level indexes can be configured by "orc.create.index"(whether to create row indexes) and "orc.row.index.stride"(number of rows between index entries).
> 3. Each Index has statistics of min, max for each column, so sort data by the filter column will bring better performance.
> 4. To use any one of the three level of indexes,we should enable predicate push-down by setting spark.sql.orc.filterPushdown=true (in sparkSQL) or hive.optimize.ppd=true (in hive).
> 
> But I found the  build-in indexes in ORC files did not work both in spark 1.5.2 and hive 1.2.1:
> First, when the query statement with where clause did't match any record (the filter column had a value beyond the range of data),  the performance when enabled  predicate push-down was almost the same with when disabled predicate push-down.  I think, when the filter column has a value beyond the range of data, all of the orc files will not be scanned if use file level indexes,  so the performance should improve obviously.
> 
> The second, when enabled "orc.create.index" and sorted data by filter column and where clause can only match a few records, the performance when enabled  predicate push-down was almost the same with when disabled predicate push-down. 
> 
> The third, when enabled  predicate push-down and "orc.create.index", the performance when  filter column had a value beyond the range of data was almost the same with when filter column had a value covering almost the whole data. 
> 
> So,  has anyone used ORC's build-in indexes before (especially in spark SQL)?  What's my issue?
> 
> Thanks!
> 
> Joseph

Re: The build-in indexes in ORC file does not work.

Posted by Mich Talebzadeh <mi...@gmail.com>.
Hi,

The parameters that control the stripe,  row group are configurable via the
ORC creation script

CREATE TABLE dummy (
     ID INT
   , CLUSTERED INT
   , SCATTERED INT
   , RANDOMISED INT
   , RANDOM_STRING VARCHAR(50)
   , SMALL_VC VARCHAR(10)
   , PADDING  VARCHAR(10)
)
CLUSTERED BY (ID) INTO 256 BUCKETS
STORED AS ORC
TBLPROPERTIES (
"orc.create.index"="true",
"orc.bloom.filter.columns"="ID",
"orc.bloom.filter.fpp"="0.05",
"orc.compress"="SNAPPY",
"orc.stripe.size"="16777216",
"orc.row.index.stride"="10000" )
;
So in here I make my stripe quite small 16MB (as opposed to default of
64MB) and give row.index.stride = 10000.

You can find out the available stats at row group for various
columns (0,1,2,3,...) by doing something like below

hive --orcfiledump --rowindex 0,1,2,3,4,5,6
/user/hive/warehouse/test.db/dummy/000000_0

In reality I have found out that the only occasion the stats are used
is when you actually bucket the table in ORC or use partition. There are
also dependencies on the block size etc as well and how many rows in each
block. If the whole table fits in a block size I believe the stats are
ignored (at least this was the case in older versions of Hive (I use Hive 2)

check the optimiser plan with

explain extended   <YOUR_CODE>

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 16 March 2016 at 10:23, Joseph <wx...@sina.com> wrote:

> Hi all,
>
> I have known that ORC provides three level of indexes within each file,
> *file level, stripe level, and row level*.
>
> The file and stripe level statistics are in the file footer so that they are easy to access to determine if the rest of the file needs to be read at all.
>
> Row level indexes include both column statistics for each row group and position for seeking to the start of the row group.
>
> The following is my understanding:
>
> 1. The file and stripe level indexes are forcibly generated, we can not control them.
>
> 2. The row level indexes can be configured by "orc.create.index"(whether to create row indexes) and "orc.row.index.stride"(number of rows between index entries).
>
> 3. Each Index has statistics of min, max for each column, so sort data by the filter column will bring better performance.
>
> 4. To use any one of the three level of indexes,we should enable predicate push-down by setting
> *spark.sql.orc.filterPushdown=true* (in sparkSQL) or
> *hive.optimize.ppd=true* (in hive).
>
> *But I found the ** build-in **indexes in ORC files
> did not work both in spark 1.5.2 and hive 1.2.1:*
> First, when the query statement with where clause did't match any record (the filter column had a value beyond the range of data),  the performance when enabled
>  predicate push-down was almost the same with when disabled predicate push-down.  I think, when the filter column has a value beyond the range of data, all of the orc files will not be scanned if use file level indexes,  so the performance should improve obviously.
>
> The second, when enabled "orc.create.index" and sorted data by filter
> column and where clause can only match a few records, the performance when enabled
>  predicate push-down was almost the same with when disabled predicate push-down.
>
> The third, when enabled  predicate push-down and "orc.create.index", the performance when
>  filter column had a value beyond the range of data
> was almost the same with when filter column had a value covering almost
> the whole data.
>
> So,  has anyone used ORC's build-in indexes before (especially in spark
> SQL)?  What's my issue?
>
> Thanks!
>
> ------------------------------
> Joseph
>

Re: The build-in indexes in ORC file does not work.

Posted by Jörn Franke <jo...@gmail.com>.
How much data are you querying? What is the query? How selective it is supposed to be? What is the block size?

> On 16 Mar 2016, at 11:23, Joseph <wx...@sina.com> wrote:
> 
> Hi all,
> 
> I have known that ORC provides three level of indexes within each file, file level, stripe level, and row level. 
> The file and stripe level statistics are in the file footer so that they are easy to access to determine if the rest of the file needs to be read at all. 
> Row level indexes include both column statistics for each row group and position for seeking to the start of the row group. 
> 
> The following is my understanding:
> 1. The file and stripe level indexes are forcibly generated, we can not control them.
> 2. The row level indexes can be configured by "orc.create.index"(whether to create row indexes) and "orc.row.index.stride"(number of rows between index entries).
> 3. Each Index has statistics of min, max for each column, so sort data by the filter column will bring better performance.
> 4. To use any one of the three level of indexes,we should enable predicate push-down by setting spark.sql.orc.filterPushdown=true (in sparkSQL) or hive.optimize.ppd=true (in hive).
> 
> But I found the  build-in indexes in ORC files did not work both in spark 1.5.2 and hive 1.2.1:
> First, when the query statement with where clause did't match any record (the filter column had a value beyond the range of data),  the performance when enabled  predicate push-down was almost the same with when disabled predicate push-down.  I think, when the filter column has a value beyond the range of data, all of the orc files will not be scanned if use file level indexes,  so the performance should improve obviously.
> 
> The second, when enabled "orc.create.index" and sorted data by filter column and where clause can only match a few records, the performance when enabled  predicate push-down was almost the same with when disabled predicate push-down. 
> 
> The third, when enabled  predicate push-down and "orc.create.index", the performance when  filter column had a value beyond the range of data was almost the same with when filter column had a value covering almost the whole data. 
> 
> So,  has anyone used ORC's build-in indexes before (especially in spark SQL)?  What's my issue?
> 
> Thanks!
> 
> Joseph