You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Mich Talebzadeh <mi...@peridale.co.uk> on 2016/01/06 07:24:21 UTC

Indexes in Hive

Hi,

Thinking loudly.

Ideally we should consider a totally columnar storage offering in which each
column of table is stored as compressed value (I disregard for now how
actually ORC does this but obviously it is not exactly a columnar storage).

So each table can be considered as a loose federation of columnar storage
and each column is effectively an index?

As columns are far narrower than tables, each index block will be very
higher density and all operations like aggregates can be done directly on
index rather than table. 

This type of table offering will be in true nature of data warehouse
storage. Of course row operations (get me all rows for this table) will be
slower but that is the trade-off that we need to consider.

Expecting users to write their own IndexHandler may be technically
interesting but commercially not viable as Hive needs to be a product on its
own merit not a development base. Writing your own storage attributes etc.
requires skills that will put off people seeing Hive as an attractive
proposition (requiring considerable investment in skill sets in order to
maintain Hive).

Thus my thinking on this is to offer true columnar storage in Hive to be a
proper data warehouse. In addition, the development tools cab ne made
available for those interested in tailoring their own specific Hive
solutions.


HTH



Dr Mich Talebzadeh

LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
pdf
Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN:
978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

http://talebzadehmich.wordpress.com

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.


-----Original Message-----
From: Gopal Vijayaraghavan [mailto:gopal@hortonworks.com] On Behalf Of Gopal
Vijayaraghavan
Sent: 05 January 2016 23:55
To: user@hive.apache.org
Subject: Re: Is Hive Index officially not recommended?


>So in a nutshell in Hive if "external" indexes are not used for 
>improving query response, what value they add and can we forget them for
now?

The builtin indexes - those that write data as smaller tables are only
useful in a pre-columnar world, where the indexes offer a huge reduction in
IO.

Part #1 of using hive indexes effectively is to write your own
HiveIndexHandler, with usesIndexTable=false;

And then write a IndexPredicateAnalyzer, which lets you map arbitrary
lookups into other range conditions.

Not coincidentally - we're adding a "ANALYZE TABLE ... CACHE METADATA"
which consolidates the "internal" index into an external store (HBase).

Some of the index data now lives in the HBase metastore, so that the
inclusion/exclusion of whole partitions can be done off the consolidated
index. 

https://issues.apache.org/jira/browse/HIVE-11676


The experience from BI workloads run by customers is that in general, the
lookup to the right "slice" of data is more of a problem than the actual
aggregate.

And that for a workhorse data warehouse, this has to survive even if there's
a non-stop stream of updates into it.

Cheers,
Gopal


RE: Indexes in Hive

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
Thanks guys

 

A typical columnar database stores data by breaking the rows of a table into
individual columns and storing the successive values in an indexed and
compressed form in data blocks. The nth row of the table can be
reconstituted by taking the nth element from each column heap

 

So data is broken into individual columns. Every column is stored as an
index, the type varying based on the native data type and cardinality (the
number of distinct values) of the underlying column. Further, since each
column occupies its own data blocks, those blocks can be compressed, again
based on the data type and index it is stored in. The Row ID (a block number
and offset) threads all of the bits of data that comprise a row together
without having to maintain any physical co-location at all. That is very
important.

 

The above essentially means that data blocks for each column have to be
contiguous. This may be challenging in HDFS because by definition a
distributed file system like HDFS cannot maintain that strict ordering of
blocks. However, can this be achieved without comprising the redundancy? May
be the location of these contiguous blocks can be maintained in NameNode in
some efficient way. If the optimiser becomes aware of this storage ordering
then column operations should be very efficient. Additionally one can create
indexes associated with these columns. It is important to remember these
additional indexes will be optimized for "single columns only" and  in some
cases they do not even need to store the underlying data value. 

 

The drawback would be that queries requiring full row operations will by
definition be inefficient together with update operations. However, I think
if it is achieved it will be a great plus for Hive.

 

Cheers,

 

 

Dr Mich Talebzadeh

 

LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

 

Sybase ASE 15 Gold Medal Award 2008

A Winning Strategy: Running the most Critical Financial Data on ASE 15

http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
pdf

Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 

co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4

Publications due shortly:

Complex Event Processing in Heterogeneous Environments, ISBN:
978-0-9563693-3-8

Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

 

http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/> 

 

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.

 

From: Alan Gates [mailto:alanfgates@gmail.com] 
Sent: 06 January 2016 18:19
To: user@hive.apache.org
Subject: Re: Indexes in Hive

 

The issue with this is that HDFS lacks the ability to co-locate blocks.  So
if you break your columns into one file per column (the more traditional
column route) you end up in a situation where 2/3 of the time only one of
your columns is being locally read, which results in a significant
performance penalty.  That's why ORC and Parquet and RCFile all use one file
for their "columnar" stores.

Alan.






 <ma...@peridale.co.uk> Mich Talebzadeh

January 5, 2016 at 22:24

Hi,

Thinking loudly.

Ideally we should consider a totally columnar storage offering in which each
column of table is stored as compressed value (I disregard for now how
actually ORC does this but obviously it is not exactly a columnar storage).

So each table can be considered as a loose federation of columnar storage
and each column is effectively an index?

As columns are far narrower than tables, each index block will be very
higher density and all operations like aggregates can be done directly on
index rather than table. 

This type of table offering will be in true nature of data warehouse
storage. Of course row operations (get me all rows for this table) will be
slower but that is the trade-off that we need to consider.

Expecting users to write their own IndexHandler may be technically
interesting but commercially not viable as Hive needs to be a product on its
own merit not a development base. Writing your own storage attributes etc.
requires skills that will put off people seeing Hive as an attractive
proposition (requiring considerable investment in skill sets in order to
maintain Hive).

Thus my thinking on this is to offer true columnar storage in Hive to be a
proper data warehouse. In addition, the development tools cab ne made
available for those interested in tailoring their own specific Hive
solutions.


HTH



Dr Mich Talebzadeh

LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
pdf
Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN:
978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

http://talebzadehmich.wordpress.com

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.


-----Original Message-----
From: Gopal Vijayaraghavan [mailto:gopal@hortonworks.com] On Behalf Of Gopal
Vijayaraghavan
Sent: 05 January 2016 23:55
To: user@hive.apache.org <ma...@hive.apache.org> 
Subject: Re: Is Hive Index officially not recommended?



now?

The builtin indexes - those that write data as smaller tables are only
useful in a pre-columnar world, where the indexes offer a huge reduction in
IO.

Part #1 of using hive indexes effectively is to write your own
HiveIndexHandler, with usesIndexTable=false;

And then write a IndexPredicateAnalyzer, which lets you map arbitrary
lookups into other range conditions.

Not coincidentally - we're adding a "ANALYZE TABLE ... CACHE METADATA"
which consolidates the "internal" index into an external store (HBase).

Some of the index data now lives in the HBase metastore, so that the
inclusion/exclusion of whole partitions can be done off the consolidated
index. 

https://issues.apache.org/jira/browse/HIVE-11676


The experience from BI workloads run by customers is that in general, the
lookup to the right "slice" of data is more of a problem than the actual
aggregate.

And that for a workhorse data warehouse, this has to survive even if there's
a non-stop stream of updates into it.

Cheers,
Gopal


Re: Indexes in Hive

Posted by Alan Gates <al...@gmail.com>.
The issue with this is that HDFS lacks the ability to co-locate blocks.  
So if you break your columns into one file per column (the more 
traditional column route) you end up in a situation where 2/3 of the 
time only one of your columns is being locally read, which results in a 
significant performance penalty.  That's why ORC and Parquet and RCFile 
all use one file for their "columnar" stores.

Alan.

> Mich Talebzadeh <ma...@peridale.co.uk>
> January 5, 2016 at 22:24
> Hi,
>
> Thinking loudly.
>
> Ideally we should consider a totally columnar storage offering in 
> which each
> column of table is stored as compressed value (I disregard for now how
> actually ORC does this but obviously it is not exactly a columnar 
> storage).
>
> So each table can be considered as a loose federation of columnar storage
> and each column is effectively an index?
>
> As columns are far narrower than tables, each index block will be very
> higher density and all operations like aggregates can be done directly on
> index rather than table.
>
> This type of table offering will be in true nature of data warehouse
> storage. Of course row operations (get me all rows for this table) will be
> slower but that is the trade-off that we need to consider.
>
> Expecting users to write their own IndexHandler may be technically
> interesting but commercially not viable as Hive needs to be a product 
> on its
> own merit not a development base. Writing your own storage attributes etc.
> requires skills that will put off people seeing Hive as an attractive
> proposition (requiring considerable investment in skill sets in order to
> maintain Hive).
>
> Thus my thinking on this is to offer true columnar storage in Hive to be a
> proper data warehouse. In addition, the development tools cab ne made
> available for those interested in tailoring their own specific Hive
> solutions.
>
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUr
> V8Pw
>
> Sybase ASE 15 Gold Medal Award 2008
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
> pdf
> Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 
> 15",
> ISBN 978-0-9563693-0-7.
> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4
> Publications due shortly:
> Complex Event Processing in Heterogeneous Environments, ISBN:
> 978-0-9563693-3-8
> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
> http://talebzadehmich.wordpress.com
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale 
> Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. 
> It is
> the responsibility of the recipient to ensure that this email is virus 
> free,
> therefore neither Peridale Ltd, its subsidiaries nor their employees 
> accept
> any responsibility.
>
>
> -----Original Message-----
> From: Gopal Vijayaraghavan [mailto:gopal@hortonworks.com] On Behalf Of 
> Gopal
> Vijayaraghavan
> Sent: 05 January 2016 23:55
> To: user@hive.apache.org
> Subject: Re: Is Hive Index officially not recommended?
>
>
> now?
>
> The builtin indexes - those that write data as smaller tables are only
> useful in a pre-columnar world, where the indexes offer a huge 
> reduction in
> IO.
>
> Part #1 of using hive indexes effectively is to write your own
> HiveIndexHandler, with usesIndexTable=false;
>
> And then write a IndexPredicateAnalyzer, which lets you map arbitrary
> lookups into other range conditions.
>
> Not coincidentally - we're adding a "ANALYZE TABLE ... CACHE METADATA"
> which consolidates the "internal" index into an external store (HBase).
>
> Some of the index data now lives in the HBase metastore, so that the
> inclusion/exclusion of whole partitions can be done off the consolidated
> index.
>
> https://issues.apache.org/jira/browse/HIVE-11676
>
>
> The experience from BI workloads run by customers is that in general, the
> lookup to the right "slice" of data is more of a problem than the actual
> aggregate.
>
> And that for a workhorse data warehouse, this has to survive even if 
> there's
> a non-stop stream of updates into it.
>
> Cheers,
> Gopal
>

RE: Indexes in Hive

Posted by Mich Talebzadeh <mi...@peridale.co.uk>.
I believe so Jorn. 

I am not sure how much it differs from ORC file storage?

Cheers,

Dr Mich Talebzadeh

LinkedIn
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUr
V8Pw

Sybase ASE 15 Gold Medal Award 2008
A Winning Strategy: Running the most Critical Financial Data on ASE 15
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
pdf
Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15",
ISBN 978-0-9563693-0-7. 
co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
978-0-9759693-0-4
Publications due shortly:
Complex Event Processing in Heterogeneous Environments, ISBN:
978-0-9563693-3-8
Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
one out shortly

http://talebzadehmich.wordpress.com

NOTE: The information in this email is proprietary and confidential. This
message is for the designated recipient only, if you are not the intended
recipient, you should destroy it immediately. Any information in this
message shall not be understood as given or endorsed by Peridale Technology
Ltd, its subsidiaries or their employees, unless expressly so stated. It is
the responsibility of the recipient to ensure that this email is virus free,
therefore neither Peridale Ltd, its subsidiaries nor their employees accept
any responsibility.


-----Original Message-----
From: Jörn Franke [mailto:jornfranke@gmail.com] 
Sent: 06 January 2016 07:49
To: user@hive.apache.org
Subject: Re: Indexes in Hive

If I understand you correctly this could be just another Hive storage
format.

> On 06 Jan 2016, at 07:24, Mich Talebzadeh <mi...@peridale.co.uk> wrote:
> 
> Hi,
> 
> Thinking loudly.
> 
> Ideally we should consider a totally columnar storage offering in 
> which each column of table is stored as compressed value (I disregard 
> for now how actually ORC does this but obviously it is not exactly a
columnar storage).
> 
> So each table can be considered as a loose federation of columnar 
> storage and each column is effectively an index?
> 
> As columns are far narrower than tables, each index block will be very 
> higher density and all operations like aggregates can be done directly 
> on index rather than table.
> 
> This type of table offering will be in true nature of data warehouse 
> storage. Of course row operations (get me all rows for this table) 
> will be slower but that is the trade-off that we need to consider.
> 
> Expecting users to write their own IndexHandler may be technically 
> interesting but commercially not viable as Hive needs to be a product 
> on its own merit not a development base. Writing your own storage
attributes etc.
> requires skills that will put off people seeing Hive as an attractive 
> proposition (requiring considerable investment in skill sets in order 
> to maintain Hive).
> 
> Thus my thinking on this is to offer true columnar storage in Hive to 
> be a proper data warehouse. In addition, the development tools cab ne 
> made available for those interested in tailoring their own specific 
> Hive solutions.
> 
> 
> HTH
> 
> 
> 
> Dr Mich Talebzadeh
> 
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCC
> dOABUr
> V8Pw
> 
> Sybase ASE 15 Gold Medal Award 2008
> A Winning Strategy: Running the most Critical Financial Data on ASE 15 
>
http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
> pdf
> Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 
> 15", ISBN 978-0-9563693-0-7.
> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4
> Publications due shortly:
> Complex Event Processing in Heterogeneous Environments, ISBN:
> 978-0-9563693-3-8
> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, 
> volume one out shortly
> 
> http://talebzadehmich.wordpress.com
> 
> NOTE: The information in this email is proprietary and confidential. 
> This message is for the designated recipient only, if you are not the 
> intended recipient, you should destroy it immediately. Any information 
> in this message shall not be understood as given or endorsed by 
> Peridale Technology Ltd, its subsidiaries or their employees, unless 
> expressly so stated. It is the responsibility of the recipient to 
> ensure that this email is virus free, therefore neither Peridale Ltd, 
> its subsidiaries nor their employees accept any responsibility.
> 
> 
> -----Original Message-----
> From: Gopal Vijayaraghavan [mailto:gopal@hortonworks.com] On Behalf Of 
> Gopal Vijayaraghavan
> Sent: 05 January 2016 23:55
> To: user@hive.apache.org
> Subject: Re: Is Hive Index officially not recommended?
> 
> 
>> So in a nutshell in Hive if "external" indexes are not used for 
>> improving query response, what value they add and can we forget them 
>> for
> now?
> 
> The builtin indexes - those that write data as smaller tables are only 
> useful in a pre-columnar world, where the indexes offer a huge 
> reduction in IO.
> 
> Part #1 of using hive indexes effectively is to write your own 
> HiveIndexHandler, with usesIndexTable=false;
> 
> And then write a IndexPredicateAnalyzer, which lets you map arbitrary 
> lookups into other range conditions.
> 
> Not coincidentally - we're adding a "ANALYZE TABLE ... CACHE METADATA"
> which consolidates the "internal" index into an external store (HBase).
> 
> Some of the index data now lives in the HBase metastore, so that the 
> inclusion/exclusion of whole partitions can be done off the 
> consolidated index.
> 
> https://issues.apache.org/jira/browse/HIVE-11676
> 
> 
> The experience from BI workloads run by customers is that in general, 
> the lookup to the right "slice" of data is more of a problem than the 
> actual aggregate.
> 
> And that for a workhorse data warehouse, this has to survive even if 
> there's a non-stop stream of updates into it.
> 
> Cheers,
> Gopal
> 


Re: Indexes in Hive

Posted by Jörn Franke <jo...@gmail.com>.
If I understand you correctly this could be just another Hive storage format.

> On 06 Jan 2016, at 07:24, Mich Talebzadeh <mi...@peridale.co.uk> wrote:
> 
> Hi,
> 
> Thinking loudly.
> 
> Ideally we should consider a totally columnar storage offering in which each
> column of table is stored as compressed value (I disregard for now how
> actually ORC does this but obviously it is not exactly a columnar storage).
> 
> So each table can be considered as a loose federation of columnar storage
> and each column is effectively an index?
> 
> As columns are far narrower than tables, each index block will be very
> higher density and all operations like aggregates can be done directly on
> index rather than table. 
> 
> This type of table offering will be in true nature of data warehouse
> storage. Of course row operations (get me all rows for this table) will be
> slower but that is the trade-off that we need to consider.
> 
> Expecting users to write their own IndexHandler may be technically
> interesting but commercially not viable as Hive needs to be a product on its
> own merit not a development base. Writing your own storage attributes etc.
> requires skills that will put off people seeing Hive as an attractive
> proposition (requiring considerable investment in skill sets in order to
> maintain Hive).
> 
> Thus my thinking on this is to offer true columnar storage in Hive to be a
> proper data warehouse. In addition, the development tools cab ne made
> available for those interested in tailoring their own specific Hive
> solutions.
> 
> 
> HTH
> 
> 
> 
> Dr Mich Talebzadeh
> 
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUr
> V8Pw
> 
> Sybase ASE 15 Gold Medal Award 2008
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
> pdf
> Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
> ISBN 978-0-9563693-0-7. 
> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4
> Publications due shortly:
> Complex Event Processing in Heterogeneous Environments, ISBN:
> 978-0-9563693-3-8
> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
> one out shortly
> 
> http://talebzadehmich.wordpress.com
> 
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus free,
> therefore neither Peridale Ltd, its subsidiaries nor their employees accept
> any responsibility.
> 
> 
> -----Original Message-----
> From: Gopal Vijayaraghavan [mailto:gopal@hortonworks.com] On Behalf Of Gopal
> Vijayaraghavan
> Sent: 05 January 2016 23:55
> To: user@hive.apache.org
> Subject: Re: Is Hive Index officially not recommended?
> 
> 
>> So in a nutshell in Hive if "external" indexes are not used for 
>> improving query response, what value they add and can we forget them for
> now?
> 
> The builtin indexes - those that write data as smaller tables are only
> useful in a pre-columnar world, where the indexes offer a huge reduction in
> IO.
> 
> Part #1 of using hive indexes effectively is to write your own
> HiveIndexHandler, with usesIndexTable=false;
> 
> And then write a IndexPredicateAnalyzer, which lets you map arbitrary
> lookups into other range conditions.
> 
> Not coincidentally - we're adding a "ANALYZE TABLE ... CACHE METADATA"
> which consolidates the "internal" index into an external store (HBase).
> 
> Some of the index data now lives in the HBase metastore, so that the
> inclusion/exclusion of whole partitions can be done off the consolidated
> index. 
> 
> https://issues.apache.org/jira/browse/HIVE-11676
> 
> 
> The experience from BI workloads run by customers is that in general, the
> lookup to the right "slice" of data is more of a problem than the actual
> aggregate.
> 
> And that for a workhorse data warehouse, this has to survive even if there's
> a non-stop stream of updates into it.
> 
> Cheers,
> Gopal
> 

Re: Indexes in Hive

Posted by Jörn Franke <jo...@gmail.com>.
I am not sure how much performance one could gain in comparison to ORC or
Parquet. They work pretty well once you know how to use them. However,
there is still ways to optimize them. For instance, sorting of data is a
key factor for these formats to be efficient. Nevertheless, if you have a
lot of columns then sorting each column individually does not make sense.
Here one could explore a sorting algorithm that, for instance, identifies
certain groups of values that are often queried together and co-locates
them. Alternatively, you can create for each row a hash sum over often
queried columns and do pruning only based on this hashsum (=one column).
This can be already done in Hive. Another alternative is to create
redundant tables and each of them sorted differently. This may be
implemented in Hive automatically depending on query patterns.

I think there is sometimes also a wrong perception of what is possible with
Big Data. If you query a petabyte of data and the query processes the whole
amount then you need a lot of nodes or simply live with the fact that it
takes longer. However, in many of the cases users usually query more data
then they need. There are also many cases where they could just work with
samples from the table to define their models and later let them evaluate
over the whole set of data over night. Additionally, usually there is not
one user, but many.

Given this, column-orientation does not solve everything anyway. It is just
one little part of the big picture. For example, graph structures, such as
provided by TitanDB with interactive Gremlin queries, can be much faster
executed for certain scenarios than in a column store.
Interactive In-memory technologies, such as Apache Ignite, can speed up
Hive or even Spark if you have a lot of users or processes that share data.
I think TEZ+LLAP show also some interesting features for Hive related to
this.


In my blog you can find some discussion on how to optimize for big data
technologies in general.

On Wed, Jan 6, 2016 at 7:24 AM, Mich Talebzadeh <mi...@peridale.co.uk> wrote:

> Hi,
>
> Thinking loudly.
>
> Ideally we should consider a totally columnar storage offering in which
> each
> column of table is stored as compressed value (I disregard for now how
> actually ORC does this but obviously it is not exactly a columnar storage).
>
> So each table can be considered as a loose federation of columnar storage
> and each column is effectively an index?
>
> As columns are far narrower than tables, each index block will be very
> higher density and all operations like aggregates can be done directly on
> index rather than table.
>
> This type of table offering will be in true nature of data warehouse
> storage. Of course row operations (get me all rows for this table) will be
> slower but that is the trade-off that we need to consider.
>
> Expecting users to write their own IndexHandler may be technically
> interesting but commercially not viable as Hive needs to be a product on
> its
> own merit not a development base. Writing your own storage attributes etc.
> requires skills that will put off people seeing Hive as an attractive
> proposition (requiring considerable investment in skill sets in order to
> maintain Hive).
>
> Thus my thinking on this is to offer true columnar storage in Hive to be a
> proper data warehouse. In addition, the development tools cab ne made
> available for those interested in tailoring their own specific Hive
> solutions.
>
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
> LinkedIn
>
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUr
> V8Pw
>
> Sybase ASE 15 Gold Medal Award 2008
> A Winning Strategy: Running the most Critical Financial Data on ASE 15
>
> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.
> pdf
> Author of the books "A Practitioner's Guide to Upgrading to Sybase ASE 15",
> ISBN 978-0-9563693-0-7.
> co-author "Sybase Transact SQL Guidelines Best Practices", ISBN
> 978-0-9759693-0-4
> Publications due shortly:
> Complex Event Processing in Heterogeneous Environments, ISBN:
> 978-0-9563693-3-8
> Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume
> one out shortly
>
> http://talebzadehmich.wordpress.com
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this
> message shall not be understood as given or endorsed by Peridale Technology
> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
> the responsibility of the recipient to ensure that this email is virus
> free,
> therefore neither Peridale Ltd, its subsidiaries nor their employees accept
> any responsibility.
>
>
> -----Original Message-----
> From: Gopal Vijayaraghavan [mailto:gopal@hortonworks.com] On Behalf Of
> Gopal
> Vijayaraghavan
> Sent: 05 January 2016 23:55
> To: user@hive.apache.org
> Subject: Re: Is Hive Index officially not recommended?
>
>
> >So in a nutshell in Hive if "external" indexes are not used for
> >improving query response, what value they add and can we forget them for
> now?
>
> The builtin indexes - those that write data as smaller tables are only
> useful in a pre-columnar world, where the indexes offer a huge reduction in
> IO.
>
> Part #1 of using hive indexes effectively is to write your own
> HiveIndexHandler, with usesIndexTable=false;
>
> And then write a IndexPredicateAnalyzer, which lets you map arbitrary
> lookups into other range conditions.
>
> Not coincidentally - we're adding a "ANALYZE TABLE ... CACHE METADATA"
> which consolidates the "internal" index into an external store (HBase).
>
> Some of the index data now lives in the HBase metastore, so that the
> inclusion/exclusion of whole partitions can be done off the consolidated
> index.
>
> https://issues.apache.org/jira/browse/HIVE-11676
>
>
> The experience from BI workloads run by customers is that in general, the
> lookup to the right "slice" of data is more of a problem than the actual
> aggregate.
>
> And that for a workhorse data warehouse, this has to survive even if
> there's
> a non-stop stream of updates into it.
>
> Cheers,
> Gopal
>
>