You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Vadim Dedkov <de...@gmail.com> on 2016/06/16 20:50:26 UTC

Hive indexes without improvement of performance

Hello!

I use Hive 1.1.0-cdh5.5.0 and try to use indexes support.

My index creation:
*CREATE INDEX doc_id_idx on TABLE my_schema_name.doc_t (id) AS 'COMPACT'
WITH DEFERRED REBUILD;*
*ALTER INDEX doc_id_idx ON my_schema_name.doc_t REBUILD;*

Then I set configs:
*set hive.optimize.autoindex=true;*
*set hive.optimize.index.filter=true;*
*set hive.optimize.index.filter.compact.minsize=0;*
*set hive.index.compact.query.max.size=-1;*
*set hive.index.compact.query.max.entries=-1; *

And my query is:
*select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*

Sometimes I have improvement of performance, but most of cases - not.

In cases when I have improvement:
1. my query is
*select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
give me NullPointerException (in logs I see that Hive doesn't find my index
table)
2. then I write:
*USE my_schema_name;*
*select count(*) from doc_t WHERE id = '3723445235879';*
and have result with improvement
(172 sec)

In case when I don't have improvement, I can use either
*select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
without exception, either
*USE my_schema_name;*
*select count(*) from doc_t WHERE id = '3723445235879';*
and have result
(1153 sec)

My table is about 6 billion rows.
I tried various combinations on index configs, including only these two:
*set hive.optimize.index.filter=true;*
*set hive.optimize.index.filter.compact.minsize=0;*
My hadoop version is 2.6.0-cdh5.5.0

What I do wrong?

Thank you.

-- 
_______________             _______________
Best regards,                    С уважением
Vadim Dedkov.                  Вадим Дедков.

Re: Hive indexes without improvement of performance

Posted by Mich Talebzadeh <mi...@gmail.com>.

Ok use explain extended your sql query to see if the optimizer makes a good
decision.

Help the optimizer by doing stats update at column level

ANALYZE TABLE <TABLE_NAME> COMPUTE STATISTICS FOR COLUMNS

use desc formatted <TABLE_NAME> to see the stats#

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 16 June 2016 at 23:01, Vadim Dedkov <de...@gmail.com> wrote:

> *without any improvement of performance
> 17 июня 2016 г. 1:00 пользователь "Vadim Dedkov" <de...@gmail.com>
> написал:
>
> Ok, thank you. I tried Hive with Tez for my index-problem without any
>> performance
>> 17 июня 2016 г. 0:22 пользователь "Mich Talebzadeh" <
>> mich.talebzadeh@gmail.com> написал:
>>
>>>
>>> Well I guess I have to agree to differ on this with Jorn as before.
>>>
>>> Vadim,
>>>
>>> Please go ahead and try what Jorn suggests. Report back if you see any
>>> improvement.
>>>
>>> Couple of points if I may:
>>>
>>> Using Hive on Tez is not going to improve Optimiser's performance. That
>>> is just the execution engine and BTW I would rather use Hive on Spark. Both
>>> TEZ and Spark will be a better fit than the usual map-reduce engibe.
>>>
>>> Actually my suggestion would be to use Hive as storage layer only and
>>> use Spark as the query tool. In that case you don't need to worry about
>>> indexes etc in Hive. Spark with DAG and In-memory computing will do a much
>>> better job.
>>>
>>> So
>>>
>>>
>>>    1. Use Hive with its metadata to store data on HDFS
>>>    2. Use Spark SQL to query that Data. Orders of magnitude faster.
>>>
>>>
>>> However, I am all for you trying what Jorn suggested.
>>>
>>> HTH
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 16 June 2016 at 22:02, Jörn Franke <jo...@gmail.com> wrote:
>>>
>>>> The indexes are based on HDFS blocksize, which is usually around 128
>>>> mb. This means for hitting a single row you must always load the full
>>>> block. In traditional databases this blocksize it is much faster. If the
>>>> optimizer does not pick up the index then you can query the index directly
>>>> (it is just a table!). Keep in mind that you should use for the index also
>>>> an adequate storage format, such as Orc or parquet.
>>>>
>>>> You should not use the traditional indexes, but use Hive+Tez and the
>>>> Orc format with storage indexes and bloom filters (i.e. Min Hive 1.2). It
>>>> is of key importance that you insert the data sorted on the columns that
>>>> you use in the where clause. You should compress the table with snappy.
>>>> Additionally partitions make sense. Finally please use the right data types
>>>> . Storage indexes work best with ints etc. for text fields you can try
>>>> bloom filters.
>>>>
>>>> That being said, also in other relational databases such as Oracle
>>>> Exadata, the use of traditional indexes is discouraged for warehouse
>>>> scenarios, but storage indexes and columnar formats including compression
>>>> will bring the most performance.
>>>>
>>>> On 16 Jun 2016, at 22:50, Vadim Dedkov <de...@gmail.com> wrote:
>>>>
>>>> Hello!
>>>>
>>>> I use Hive 1.1.0-cdh5.5.0 and try to use indexes support.
>>>>
>>>> My index creation:
>>>> *CREATE INDEX doc_id_idx on TABLE my_schema_name.doc_t (id) AS
>>>> 'COMPACT' WITH DEFERRED REBUILD;*
>>>> *ALTER INDEX doc_id_idx ON my_schema_name.doc_t REBUILD;*
>>>>
>>>> Then I set configs:
>>>> *set hive.optimize.autoindex=true;*
>>>> *set hive.optimize.index.filter=true;*
>>>> *set hive.optimize.index.filter.compact.minsize=0;*
>>>> *set hive.index.compact.query.max.size=-1;*
>>>> *set hive.index.compact.query.max.entries=-1; *
>>>>
>>>> And my query is:
>>>> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
>>>>
>>>> Sometimes I have improvement of performance, but most of cases - not.
>>>>
>>>> In cases when I have improvement:
>>>> 1. my query is
>>>> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
>>>> give me NullPointerException (in logs I see that Hive doesn't find my
>>>> index table)
>>>> 2. then I write:
>>>> *USE my_schema_name;*
>>>> *select count(*) from doc_t WHERE id = '3723445235879';*
>>>> and have result with improvement
>>>> (172 sec)
>>>>
>>>> In case when I don't have improvement, I can use either
>>>> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
>>>> without exception, either
>>>> *USE my_schema_name;*
>>>> *select count(*) from doc_t WHERE id = '3723445235879';*
>>>> and have result
>>>> (1153 sec)
>>>>
>>>> My table is about 6 billion rows.
>>>> I tried various combinations on index configs, including only these
>>>> two:
>>>> *set hive.optimize.index.filter=true;*
>>>> *set hive.optimize.index.filter.compact.minsize=0;*
>>>> My hadoop version is 2.6.0-cdh5.5.0
>>>>
>>>> What I do wrong?
>>>>
>>>> Thank you.
>>>>
>>>> --
>>>> _______________             _______________
>>>> Best regards,                    С уважением
>>>> Vadim Dedkov.                  Вадим Дедков.
>>>>
>>>>
>>>

Re: Hive indexes without improvement of performance

Posted by Vadim Dedkov <de...@gmail.com>.

*without any improvement of performance
17 июня 2016 г. 1:00 пользователь "Vadim Dedkov" <de...@gmail.com>
написал:

> Ok, thank you. I tried Hive with Tez for my index-problem without any
> performance
> 17 июня 2016 г. 0:22 пользователь "Mich Talebzadeh" <
> mich.talebzadeh@gmail.com> написал:
>
>>
>> Well I guess I have to agree to differ on this with Jorn as before.
>>
>> Vadim,
>>
>> Please go ahead and try what Jorn suggests. Report back if you see any
>> improvement.
>>
>> Couple of points if I may:
>>
>> Using Hive on Tez is not going to improve Optimiser's performance. That
>> is just the execution engine and BTW I would rather use Hive on Spark. Both
>> TEZ and Spark will be a better fit than the usual map-reduce engibe.
>>
>> Actually my suggestion would be to use Hive as storage layer only and use
>> Spark as the query tool. In that case you don't need to worry about indexes
>> etc in Hive. Spark with DAG and In-memory computing will do a much better
>> job.
>>
>> So
>>
>>
>>    1. Use Hive with its metadata to store data on HDFS
>>    2. Use Spark SQL to query that Data. Orders of magnitude faster.
>>
>>
>> However, I am all for you trying what Jorn suggested.
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 16 June 2016 at 22:02, Jörn Franke <jo...@gmail.com> wrote:
>>
>>> The indexes are based on HDFS blocksize, which is usually around 128 mb.
>>> This means for hitting a single row you must always load the full block. In
>>> traditional databases this blocksize it is much faster. If the optimizer
>>> does not pick up the index then you can query the index directly (it is
>>> just a table!). Keep in mind that you should use for the index also an
>>> adequate storage format, such as Orc or parquet.
>>>
>>> You should not use the traditional indexes, but use Hive+Tez and the Orc
>>> format with storage indexes and bloom filters (i.e. Min Hive 1.2). It is of
>>> key importance that you insert the data sorted on the columns that you use
>>> in the where clause. You should compress the table with snappy.
>>> Additionally partitions make sense. Finally please use the right data types
>>> . Storage indexes work best with ints etc. for text fields you can try
>>> bloom filters.
>>>
>>> That being said, also in other relational databases such as Oracle
>>> Exadata, the use of traditional indexes is discouraged for warehouse
>>> scenarios, but storage indexes and columnar formats including compression
>>> will bring the most performance.
>>>
>>> On 16 Jun 2016, at 22:50, Vadim Dedkov <de...@gmail.com> wrote:
>>>
>>> Hello!
>>>
>>> I use Hive 1.1.0-cdh5.5.0 and try to use indexes support.
>>>
>>> My index creation:
>>> *CREATE INDEX doc_id_idx on TABLE my_schema_name.doc_t (id) AS 'COMPACT'
>>> WITH DEFERRED REBUILD;*
>>> *ALTER INDEX doc_id_idx ON my_schema_name.doc_t REBUILD;*
>>>
>>> Then I set configs:
>>> *set hive.optimize.autoindex=true;*
>>> *set hive.optimize.index.filter=true;*
>>> *set hive.optimize.index.filter.compact.minsize=0;*
>>> *set hive.index.compact.query.max.size=-1;*
>>> *set hive.index.compact.query.max.entries=-1; *
>>>
>>> And my query is:
>>> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
>>>
>>> Sometimes I have improvement of performance, but most of cases - not.
>>>
>>> In cases when I have improvement:
>>> 1. my query is
>>> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
>>> give me NullPointerException (in logs I see that Hive doesn't find my
>>> index table)
>>> 2. then I write:
>>> *USE my_schema_name;*
>>> *select count(*) from doc_t WHERE id = '3723445235879';*
>>> and have result with improvement
>>> (172 sec)
>>>
>>> In case when I don't have improvement, I can use either
>>> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
>>> without exception, either
>>> *USE my_schema_name;*
>>> *select count(*) from doc_t WHERE id = '3723445235879';*
>>> and have result
>>> (1153 sec)
>>>
>>> My table is about 6 billion rows.
>>> I tried various combinations on index configs, including only these two:
>>> *set hive.optimize.index.filter=true;*
>>> *set hive.optimize.index.filter.compact.minsize=0;*
>>> My hadoop version is 2.6.0-cdh5.5.0
>>>
>>> What I do wrong?
>>>
>>> Thank you.
>>>
>>> --
>>> _______________             _______________
>>> Best regards,                    С уважением
>>> Vadim Dedkov.                  Вадим Дедков.
>>>
>>>
>>

Re: Hive indexes without improvement of performance

Posted by Vadim Dedkov <de...@gmail.com>.

Ok, thank you. I tried Hive with Tez for my index-problem without any
performance
17 июня 2016 г. 0:22 пользователь "Mich Talebzadeh" <
mich.talebzadeh@gmail.com> написал:

>
> Well I guess I have to agree to differ on this with Jorn as before.
>
> Vadim,
>
> Please go ahead and try what Jorn suggests. Report back if you see any
> improvement.
>
> Couple of points if I may:
>
> Using Hive on Tez is not going to improve Optimiser's performance. That is
> just the execution engine and BTW I would rather use Hive on Spark. Both
> TEZ and Spark will be a better fit than the usual map-reduce engibe.
>
> Actually my suggestion would be to use Hive as storage layer only and use
> Spark as the query tool. In that case you don't need to worry about indexes
> etc in Hive. Spark with DAG and In-memory computing will do a much better
> job.
>
> So
>
>
>    1. Use Hive with its metadata to store data on HDFS
>    2. Use Spark SQL to query that Data. Orders of magnitude faster.
>
>
> However, I am all for you trying what Jorn suggested.
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 16 June 2016 at 22:02, Jörn Franke <jo...@gmail.com> wrote:
>
>> The indexes are based on HDFS blocksize, which is usually around 128 mb.
>> This means for hitting a single row you must always load the full block. In
>> traditional databases this blocksize it is much faster. If the optimizer
>> does not pick up the index then you can query the index directly (it is
>> just a table!). Keep in mind that you should use for the index also an
>> adequate storage format, such as Orc or parquet.
>>
>> You should not use the traditional indexes, but use Hive+Tez and the Orc
>> format with storage indexes and bloom filters (i.e. Min Hive 1.2). It is of
>> key importance that you insert the data sorted on the columns that you use
>> in the where clause. You should compress the table with snappy.
>> Additionally partitions make sense. Finally please use the right data types
>> . Storage indexes work best with ints etc. for text fields you can try
>> bloom filters.
>>
>> That being said, also in other relational databases such as Oracle
>> Exadata, the use of traditional indexes is discouraged for warehouse
>> scenarios, but storage indexes and columnar formats including compression
>> will bring the most performance.
>>
>> On 16 Jun 2016, at 22:50, Vadim Dedkov <de...@gmail.com> wrote:
>>
>> Hello!
>>
>> I use Hive 1.1.0-cdh5.5.0 and try to use indexes support.
>>
>> My index creation:
>> *CREATE INDEX doc_id_idx on TABLE my_schema_name.doc_t (id) AS 'COMPACT'
>> WITH DEFERRED REBUILD;*
>> *ALTER INDEX doc_id_idx ON my_schema_name.doc_t REBUILD;*
>>
>> Then I set configs:
>> *set hive.optimize.autoindex=true;*
>> *set hive.optimize.index.filter=true;*
>> *set hive.optimize.index.filter.compact.minsize=0;*
>> *set hive.index.compact.query.max.size=-1;*
>> *set hive.index.compact.query.max.entries=-1; *
>>
>> And my query is:
>> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
>>
>> Sometimes I have improvement of performance, but most of cases - not.
>>
>> In cases when I have improvement:
>> 1. my query is
>> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
>> give me NullPointerException (in logs I see that Hive doesn't find my
>> index table)
>> 2. then I write:
>> *USE my_schema_name;*
>> *select count(*) from doc_t WHERE id = '3723445235879';*
>> and have result with improvement
>> (172 sec)
>>
>> In case when I don't have improvement, I can use either
>> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
>> without exception, either
>> *USE my_schema_name;*
>> *select count(*) from doc_t WHERE id = '3723445235879';*
>> and have result
>> (1153 sec)
>>
>> My table is about 6 billion rows.
>> I tried various combinations on index configs, including only these two:
>> *set hive.optimize.index.filter=true;*
>> *set hive.optimize.index.filter.compact.minsize=0;*
>> My hadoop version is 2.6.0-cdh5.5.0
>>
>> What I do wrong?
>>
>> Thank you.
>>
>> --
>> _______________             _______________
>> Best regards,                    С уважением
>> Vadim Dedkov.                  Вадим Дедков.
>>
>>
>

Re: Hive indexes without improvement of performance

Posted by Mich Talebzadeh <mi...@gmail.com>.

Well I guess I have to agree to differ on this with Jorn as before.

Vadim,

Please go ahead and try what Jorn suggests. Report back if you see any
improvement.

Couple of points if I may:

Using Hive on Tez is not going to improve Optimiser's performance. That is
just the execution engine and BTW I would rather use Hive on Spark. Both
TEZ and Spark will be a better fit than the usual map-reduce engibe.

Actually my suggestion would be to use Hive as storage layer only and use
Spark as the query tool. In that case you don't need to worry about indexes
etc in Hive. Spark with DAG and In-memory computing will do a much better
job.

So


   1. Use Hive with its metadata to store data on HDFS
   2. Use Spark SQL to query that Data. Orders of magnitude faster.


However, I am all for you trying what Jorn suggested.

HTH


Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 16 June 2016 at 22:02, Jörn Franke <jo...@gmail.com> wrote:

> The indexes are based on HDFS blocksize, which is usually around 128 mb.
> This means for hitting a single row you must always load the full block. In
> traditional databases this blocksize it is much faster. If the optimizer
> does not pick up the index then you can query the index directly (it is
> just a table!). Keep in mind that you should use for the index also an
> adequate storage format, such as Orc or parquet.
>
> You should not use the traditional indexes, but use Hive+Tez and the Orc
> format with storage indexes and bloom filters (i.e. Min Hive 1.2). It is of
> key importance that you insert the data sorted on the columns that you use
> in the where clause. You should compress the table with snappy.
> Additionally partitions make sense. Finally please use the right data types
> . Storage indexes work best with ints etc. for text fields you can try
> bloom filters.
>
> That being said, also in other relational databases such as Oracle
> Exadata, the use of traditional indexes is discouraged for warehouse
> scenarios, but storage indexes and columnar formats including compression
> will bring the most performance.
>
> On 16 Jun 2016, at 22:50, Vadim Dedkov <de...@gmail.com> wrote:
>
> Hello!
>
> I use Hive 1.1.0-cdh5.5.0 and try to use indexes support.
>
> My index creation:
> *CREATE INDEX doc_id_idx on TABLE my_schema_name.doc_t (id) AS 'COMPACT'
> WITH DEFERRED REBUILD;*
> *ALTER INDEX doc_id_idx ON my_schema_name.doc_t REBUILD;*
>
> Then I set configs:
> *set hive.optimize.autoindex=true;*
> *set hive.optimize.index.filter=true;*
> *set hive.optimize.index.filter.compact.minsize=0;*
> *set hive.index.compact.query.max.size=-1;*
> *set hive.index.compact.query.max.entries=-1; *
>
> And my query is:
> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
>
> Sometimes I have improvement of performance, but most of cases - not.
>
> In cases when I have improvement:
> 1. my query is
> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
> give me NullPointerException (in logs I see that Hive doesn't find my
> index table)
> 2. then I write:
> *USE my_schema_name;*
> *select count(*) from doc_t WHERE id = '3723445235879';*
> and have result with improvement
> (172 sec)
>
> In case when I don't have improvement, I can use either
> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
> without exception, either
> *USE my_schema_name;*
> *select count(*) from doc_t WHERE id = '3723445235879';*
> and have result
> (1153 sec)
>
> My table is about 6 billion rows.
> I tried various combinations on index configs, including only these two:
> *set hive.optimize.index.filter=true;*
> *set hive.optimize.index.filter.compact.minsize=0;*
> My hadoop version is 2.6.0-cdh5.5.0
>
> What I do wrong?
>
> Thank you.
>
> --
> _______________             _______________
> Best regards,                    С уважением
> Vadim Dedkov.                  Вадим Дедков.
>
>

Re: Hive indexes without improvement of performance

Posted by Vadim Dedkov <de...@gmail.com>.

I explain. I can get result for count(*) with index table help, but I can't
realise how I can get result for *-select with index table help
17 июня 2016 г. 0:50 пользователь "Vadim Dedkov" <de...@gmail.com>
написал:

> >>If the optimizer does not pick up the index then you can query the index
> directly
> Could you explain me, how I can do this for query like
>
> *select * from my_schema_name.doc_t WHERE id = '3723445235879';*
>
> ?
> Thank you
> 17 июня 2016 г. 0:03 пользователь "Jörn Franke" <jo...@gmail.com>
> написал:
>
>> The indexes are based on HDFS blocksize, which is usually around 128 mb.
>> This means for hitting a single row you must always load the full block. In
>> traditional databases this blocksize it is much faster. If the optimizer
>> does not pick up the index then you can query the index directly (it is
>> just a table!). Keep in mind that you should use for the index also an
>> adequate storage format, such as Orc or parquet.
>>
>> You should not use the traditional indexes, but use Hive+Tez and the Orc
>> format with storage indexes and bloom filters (i.e. Min Hive 1.2). It is of
>> key importance that you insert the data sorted on the columns that you use
>> in the where clause. You should compress the table with snappy.
>> Additionally partitions make sense. Finally please use the right data types
>> . Storage indexes work best with ints etc. for text fields you can try
>> bloom filters.
>>
>> That being said, also in other relational databases such as Oracle
>> Exadata, the use of traditional indexes is discouraged for warehouse
>> scenarios, but storage indexes and columnar formats including compression
>> will bring the most performance.
>>
>> On 16 Jun 2016, at 22:50, Vadim Dedkov <de...@gmail.com> wrote:
>>
>> Hello!
>>
>> I use Hive 1.1.0-cdh5.5.0 and try to use indexes support.
>>
>> My index creation:
>> *CREATE INDEX doc_id_idx on TABLE my_schema_name.doc_t (id) AS 'COMPACT'
>> WITH DEFERRED REBUILD;*
>> *ALTER INDEX doc_id_idx ON my_schema_name.doc_t REBUILD;*
>>
>> Then I set configs:
>> *set hive.optimize.autoindex=true;*
>> *set hive.optimize.index.filter=true;*
>> *set hive.optimize.index.filter.compact.minsize=0;*
>> *set hive.index.compact.query.max.size=-1;*
>> *set hive.index.compact.query.max.entries=-1; *
>>
>> And my query is:
>> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
>>
>> Sometimes I have improvement of performance, but most of cases - not.
>>
>> In cases when I have improvement:
>> 1. my query is
>> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
>> give me NullPointerException (in logs I see that Hive doesn't find my
>> index table)
>> 2. then I write:
>> *USE my_schema_name;*
>> *select count(*) from doc_t WHERE id = '3723445235879';*
>> and have result with improvement
>> (172 sec)
>>
>> In case when I don't have improvement, I can use either
>> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
>> without exception, either
>> *USE my_schema_name;*
>> *select count(*) from doc_t WHERE id = '3723445235879';*
>> and have result
>> (1153 sec)
>>
>> My table is about 6 billion rows.
>> I tried various combinations on index configs, including only these two:
>> *set hive.optimize.index.filter=true;*
>> *set hive.optimize.index.filter.compact.minsize=0;*
>> My hadoop version is 2.6.0-cdh5.5.0
>>
>> What I do wrong?
>>
>> Thank you.
>>
>> --
>> _______________             _______________
>> Best regards,                    С уважением
>> Vadim Dedkov.                  Вадим Дедков.
>>
>>

Re: Hive indexes without improvement of performance

Posted by Vadim Dedkov <de...@gmail.com>.

>>If the optimizer does not pick up the index then you can query the index
directly
Could you explain me, how I can do this for query like

*select * from my_schema_name.doc_t WHERE id = '3723445235879';*

?
Thank you
17 июня 2016 г. 0:03 пользователь "Jörn Franke" <jo...@gmail.com>
написал:

> The indexes are based on HDFS blocksize, which is usually around 128 mb.
> This means for hitting a single row you must always load the full block. In
> traditional databases this blocksize it is much faster. If the optimizer
> does not pick up the index then you can query the index directly (it is
> just a table!). Keep in mind that you should use for the index also an
> adequate storage format, such as Orc or parquet.
>
> You should not use the traditional indexes, but use Hive+Tez and the Orc
> format with storage indexes and bloom filters (i.e. Min Hive 1.2). It is of
> key importance that you insert the data sorted on the columns that you use
> in the where clause. You should compress the table with snappy.
> Additionally partitions make sense. Finally please use the right data types
> . Storage indexes work best with ints etc. for text fields you can try
> bloom filters.
>
> That being said, also in other relational databases such as Oracle
> Exadata, the use of traditional indexes is discouraged for warehouse
> scenarios, but storage indexes and columnar formats including compression
> will bring the most performance.
>
> On 16 Jun 2016, at 22:50, Vadim Dedkov <de...@gmail.com> wrote:
>
> Hello!
>
> I use Hive 1.1.0-cdh5.5.0 and try to use indexes support.
>
> My index creation:
> *CREATE INDEX doc_id_idx on TABLE my_schema_name.doc_t (id) AS 'COMPACT'
> WITH DEFERRED REBUILD;*
> *ALTER INDEX doc_id_idx ON my_schema_name.doc_t REBUILD;*
>
> Then I set configs:
> *set hive.optimize.autoindex=true;*
> *set hive.optimize.index.filter=true;*
> *set hive.optimize.index.filter.compact.minsize=0;*
> *set hive.index.compact.query.max.size=-1;*
> *set hive.index.compact.query.max.entries=-1; *
>
> And my query is:
> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
>
> Sometimes I have improvement of performance, but most of cases - not.
>
> In cases when I have improvement:
> 1. my query is
> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
> give me NullPointerException (in logs I see that Hive doesn't find my
> index table)
> 2. then I write:
> *USE my_schema_name;*
> *select count(*) from doc_t WHERE id = '3723445235879';*
> and have result with improvement
> (172 sec)
>
> In case when I don't have improvement, I can use either
> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
> without exception, either
> *USE my_schema_name;*
> *select count(*) from doc_t WHERE id = '3723445235879';*
> and have result
> (1153 sec)
>
> My table is about 6 billion rows.
> I tried various combinations on index configs, including only these two:
> *set hive.optimize.index.filter=true;*
> *set hive.optimize.index.filter.compact.minsize=0;*
> My hadoop version is 2.6.0-cdh5.5.0
>
> What I do wrong?
>
> Thank you.
>
> --
> _______________             _______________
> Best regards,                    С уважением
> Vadim Dedkov.                  Вадим Дедков.
>
>

Re: Hive indexes without improvement of performance

Posted by Jörn Franke <jo...@gmail.com>.

The indexes are based on HDFS blocksize, which is usually around 128 mb. This means for hitting a single row you must always load the full block. In traditional databases this blocksize it is much faster. If the optimizer does not pick up the index then you can query the index directly (it is just a table!). Keep in mind that you should use for the index also an adequate storage format, such as Orc or parquet.

You should not use the traditional indexes, but use Hive+Tez and the Orc format with storage indexes and bloom filters (i.e. Min Hive 1.2). It is of key importance that you insert the data sorted on the columns that you use in the where clause. You should compress the table with snappy. Additionally partitions make sense. Finally please use the right data types . Storage indexes work best with ints etc. for text fields you can try bloom filters.

That being said, also in other relational databases such as Oracle Exadata, the use of traditional indexes is discouraged for warehouse scenarios, but storage indexes and columnar formats including compression will bring the most performance.

> On 16 Jun 2016, at 22:50, Vadim Dedkov <de...@gmail.com> wrote:
> 
> Hello!
> 
> I use Hive 1.1.0-cdh5.5.0 and try to use indexes support.
>  
> My index creation:
> CREATE INDEX doc_id_idx on TABLE my_schema_name.doc_t (id) AS 'COMPACT' WITH DEFERRED REBUILD;
> ALTER INDEX doc_id_idx ON my_schema_name.doc_t REBUILD;
> 
> Then I set configs:
> set hive.optimize.autoindex=true;
> set hive.optimize.index.filter=true;
> set hive.optimize.index.filter.compact.minsize=0;
> set hive.index.compact.query.max.size=-1;
> set hive.index.compact.query.max.entries=-1; 
> 
> And my query is:
> select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';
> 
> Sometimes I have improvement of performance, but most of cases - not.
> 
> In cases when I have improvement:
> 1. my query is
> select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';
> give me NullPointerException (in logs I see that Hive doesn't find my index table)
> 2. then I write:
> USE my_schema_name;
> select count(*) from doc_t WHERE id = '3723445235879';
> and have result with improvement
> (172 sec)
> 
> In case when I don't have improvement, I can use either
> select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';
> without exception, either
> USE my_schema_name;
> select count(*) from doc_t WHERE id = '3723445235879';
> and have result
> (1153 sec)
> 
> My table is about 6 billion rows.
> I tried various combinations on index configs, including only these two: 
> set hive.optimize.index.filter=true;
> set hive.optimize.index.filter.compact.minsize=0;
> My hadoop version is 2.6.0-cdh5.5.0
> 
> What I do wrong?
> 
> Thank you.
> 
> -- 
> _______________             _______________
> Best regards,                    С уважением
> Vadim Dedkov.                  Вадим Дедков.

Re: Hive indexes without improvement of performance

Posted by Mich Talebzadeh <mi...@gmail.com>.

Nothing.

Hive does not support external indexes even in version 2.

In other words, although you create indexes, they are not visible to Hive
optimizer as you have found out.

I wrote an article on this hoping that we should have external indexes
being used .

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 16 June 2016 at 21:50, Vadim Dedkov <de...@gmail.com> wrote:

> Hello!
>
> I use Hive 1.1.0-cdh5.5.0 and try to use indexes support.
>
> My index creation:
> *CREATE INDEX doc_id_idx on TABLE my_schema_name.doc_t (id) AS 'COMPACT'
> WITH DEFERRED REBUILD;*
> *ALTER INDEX doc_id_idx ON my_schema_name.doc_t REBUILD;*
>
> Then I set configs:
> *set hive.optimize.autoindex=true;*
> *set hive.optimize.index.filter=true;*
> *set hive.optimize.index.filter.compact.minsize=0;*
> *set hive.index.compact.query.max.size=-1;*
> *set hive.index.compact.query.max.entries=-1; *
>
> And my query is:
> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
>
> Sometimes I have improvement of performance, but most of cases - not.
>
> In cases when I have improvement:
> 1. my query is
> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
> give me NullPointerException (in logs I see that Hive doesn't find my
> index table)
> 2. then I write:
> *USE my_schema_name;*
> *select count(*) from doc_t WHERE id = '3723445235879';*
> and have result with improvement
> (172 sec)
>
> In case when I don't have improvement, I can use either
> *select count(*) from my_schema_name.doc_t WHERE id = '3723445235879';*
> without exception, either
> *USE my_schema_name;*
> *select count(*) from doc_t WHERE id = '3723445235879';*
> and have result
> (1153 sec)
>
> My table is about 6 billion rows.
> I tried various combinations on index configs, including only these two:
> *set hive.optimize.index.filter=true;*
> *set hive.optimize.index.filter.compact.minsize=0;*
> My hadoop version is 2.6.0-cdh5.5.0
>
> What I do wrong?
>
> Thank you.
>
> --
> _______________             _______________
> Best regards,                    С уважением
> Vadim Dedkov.                  Вадим Дедков.
>