You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Mich Talebzadeh <mi...@gmail.com> on 2020/11/11 16:25:20 UTC

Fwd: How useful are tools for Hive data modeling

Hi all,


I wrote these notes earlier this year.


I heard today that someone mentioned Hive 1 does not support indexes but
hive 2 does.


I still believe that Hive does not support indexing as per below. Has this
been changed?


Regards,


Mich

---------- Forwarded message ---------
From: Mich Talebzadeh <mi...@gmail.com>
Date: Thu, 2 Apr 2020 at 12:17
Subject: How useful are tools for Hive data modeling
To: user <us...@hive.apache.org>


Hi,

Fundamentally Hive tables have structure and support provided by desc
formatted <TABLE> and show partitions <TABLE>.

Hive does not support indexes in real HQL operations (I stand corrected).
So what we have are tables, partitions and clustering (AKA hash
partitioning).

Hive does not support indexes because Hadoop lacks blocks locality
necessary for indexes. So If I use a tool like Collibra, Ab-intio etc what
advantage(s) one is going to gain on top a simple sell scrip to get table
and partition definitions?

Thanks,


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Re: How useful are tools for Hive data modeling

Posted by Panos Garefalakis <pa...@gmail.com>.

Hey Mich,


I agree with Austin's reply, a fundamental way of  skipping data reading
that is not necessary for the query is table partitioning so that would be
the first thing to check (along with skewness).

Columnar formats such as Parquet, and ORC come with row group statistics
(such as min/max values per Column, per thousands or rows) that can
eliminate data as well and thus significantly speed up selective queries.
You just have to make sure these stats are there and are properly used
(check PPD on Hive).

Finally, I am not aware of your setup / query etc. but I would argue that
Hive's performance can be pretty competitive.

Cheers,
Panagiotis

On Wed, Nov 11, 2020 at 8:30 PM Austin Hackett <ha...@me.com> wrote:

> Hi Mich
>
> Understood, I was thinking along the lines of the tool being able to
> auto-generate SQL join syntax etc, rather than in terms of scan performance.
>
> I’m not so familiar with Parquet with Hive. I know that Parquet also has
> min and max indexes, and more recently bloom filters. However, I recall
> reading that Hive can’t take advantage of them. That might have changed
> since though? In order to make the most of of these, you usually need to
> sort your data at insert time, which may or may not be feasible.
>
> If nicely selective partitioning key, plus a columnar file format (which
> of course Parquet is) doesn’t give you the performance you need, I guess a
> hand rolled "materialised view" is where I’d look next (Hive 3.x does have
> native MV support, but I I think only with ORC).
>
> Thanks
>
> Austin
>
>
>
> On 11 Nov 2020, at 19:59, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
> Many thanks Austin.
>
> The challenge I have been told is how to effectively query a subset of
> data avoiding full table scan. The tables I believe are parquet.
>
> I know performance in Hive is not that great, so anything that could help
> would be great.
>
> Cheers,
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 11 Nov 2020 at 19:32, Austin Hackett <ha...@me.com> wrote:
>
>> Hi Mich
>>
>> Hive also has non-validated primary key, foreign key etc constraints.
>> Whilst I’m not too familiar with the modelling tools you mention, perhaps
>> they’re able to use these for generating SQL etc?
>>
>> ORC files have indexes (min, max, bloom filters) - not particularly
>> relevant to the data modelling tools question, but mentioning it for
>> completeness…
>>
>> Thanks
>>
>> Austin
>>
>>
>> On 11 Nov 2020, at 17:14, Mich Talebzadeh <mi...@gmail.com>
>> wrote:
>>
>> Many thanks Peter.
>>
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 11 Nov 2020 at 16:58, Peter Vary <pv...@cloudera.com> wrote:
>>
>>> Hi Mich,
>>>
>>> Index support was removed from hive:
>>>
>>>    - https://issues.apache.org/jira/browse/HIVE-21968
>>>    - https://issues.apache.org/jira/browse/HIVE-18715
>>>
>>>
>>> Thanks,
>>> Peter
>>>
>>> On Nov 11, 2020, at 17:25, Mich Talebzadeh <mi...@gmail.com>
>>> wrote:
>>>
>>> Hi all,
>>>
>>> I wrote these notes earlier this year.
>>>
>>> I heard today that someone mentioned Hive 1 does not support indexes but
>>> hive 2 does.
>>>
>>> I still believe that Hive does not support indexing as per below. Has
>>> this been changed?
>>>
>>> Regards,
>>>
>>> Mich
>>>
>>> ---------- Forwarded message ---------
>>> From: Mich Talebzadeh <mi...@gmail.com>
>>> Date: Thu, 2 Apr 2020 at 12:17
>>> Subject: How useful are tools for Hive data modeling
>>> To: user <us...@hive.apache.org>
>>>
>>>
>>> Hi,
>>>
>>> Fundamentally Hive tables have structure and support provided by desc
>>> formatted <TABLE> and show partitions <TABLE>.
>>>
>>> Hive does not support indexes in real HQL operations (I stand
>>> corrected). So what we have are tables, partitions and clustering (AKA hash
>>> partitioning).
>>>
>>> Hive does not support indexes because Hadoop lacks blocks locality
>>> necessary for indexes. So If I use a tool like Collibra, Ab-intio etc what
>>> advantage(s) one is going to gain on top a simple sell scrip to get table
>>> and partition definitions?
>>>
>>> Thanks,
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>>
>>
>

Re: How useful are tools for Hive data modeling

Posted by Austin Hackett <ha...@me.com>.

Hi Mich

Understood, I was thinking along the lines of the tool being able to auto-generate SQL join syntax etc, rather than in terms of scan performance.

I’m not so familiar with Parquet with Hive. I know that Parquet also has min and max indexes, and more recently bloom filters. However, I recall reading that Hive can’t take advantage of them. That might have changed since though? In order to make the most of of these, you usually need to sort your data at insert time, which may or may not be feasible.

If nicely selective partitioning key, plus a columnar file format (which of course Parquet is) doesn’t give you the performance you need, I guess a hand rolled "materialised view" is where I’d look next (Hive 3.x does have native MV support, but I I think only with ORC).

Thanks

Austin



> On 11 Nov 2020, at 19:59, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> Many thanks Austin.
> 
> The challenge I have been told is how to effectively query a subset of data avoiding full table scan. The tables I believe are parquet.
> 
> I know performance in Hive is not that great, so anything that could help would be great.
> 
> Cheers,
> 
>  
> 
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
> 
> 
> On Wed, 11 Nov 2020 at 19:32, Austin Hackett <hacketta_57@me.com <ma...@me.com>> wrote:
> Hi Mich
> 
> Hive also has non-validated primary key, foreign key etc constraints. Whilst I’m not too familiar with the modelling tools you mention, perhaps they’re able to use these for generating SQL etc?
> 
> ORC files have indexes (min, max, bloom filters) - not particularly relevant to the data modelling tools question, but mentioning it for completeness…
> 
> Thanks
> 
> Austin
> 
> 
>> On 11 Nov 2020, at 17:14, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Many thanks Peter. 
>> 
>> 
>>  
>> 
>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> 
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>  
>> 
>> 
>> On Wed, 11 Nov 2020 at 16:58, Peter Vary <pvary@cloudera.com <ma...@cloudera.com>> wrote:
>> Hi Mich,
>> 
>> Index support was removed from hive:
>> https://issues.apache.org/jira/browse/HIVE-21968 <https://issues.apache.org/jira/browse/HIVE-21968>
>> https://issues.apache.org/jira/browse/HIVE-18715 <https://issues.apache.org/jira/browse/HIVE-18715>
>> 
>> Thanks,
>> Peter
>> 
>>> On Nov 11, 2020, at 17:25, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> Hi all,
>>> 
>>> I wrote these notes earlier this year. 
>>> 
>>> I heard today that someone mentioned Hive 1 does not support indexes but hive 2 does.
>>> 
>>> I still believe that Hive does not support indexing as per below. Has this been changed?
>>> 
>>> Regards,
>>> 
>>> Mich
>>> 
>>> ---------- Forwarded message ---------
>>> From: Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>>
>>> Date: Thu, 2 Apr 2020 at 12:17
>>> Subject: How useful are tools for Hive data modeling
>>> To: user <user@hive.apache.org <ma...@hive.apache.org>>
>>> 
>>> 
>>> Hi,
>>> 
>>> Fundamentally Hive tables have structure and support provided by desc formatted <TABLE> and show partitions <TABLE>.
>>> 
>>> Hive does not support indexes in real HQL operations (I stand corrected). So what we have are tables, partitions and clustering (AKA hash partitioning). 
>>> 
>>> Hive does not support indexes because Hadoop lacks blocks locality necessary for indexes. So If I use a tool like Collibra, Ab-intio etc what advantage(s) one is going to gain on top a simple sell scrip to get table and partition definitions?
>>> 
>>> Thanks,
>>> 
>>> 
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>>  
>> 
>

Re: How useful are tools for Hive data modeling

Posted by Mich Talebzadeh <mi...@gmail.com>.

Many thanks Austin.

The challenge I have been told is how to effectively query a subset of data
avoiding full table scan. The tables I believe are parquet.

I know performance in Hive is not that great, so anything that could help
would be great.

Cheers,



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 11 Nov 2020 at 19:32, Austin Hackett <ha...@me.com> wrote:

> Hi Mich
>
> Hive also has non-validated primary key, foreign key etc constraints.
> Whilst I’m not too familiar with the modelling tools you mention, perhaps
> they’re able to use these for generating SQL etc?
>
> ORC files have indexes (min, max, bloom filters) - not particularly
> relevant to the data modelling tools question, but mentioning it for
> completeness…
>
> Thanks
>
> Austin
>
>
> On 11 Nov 2020, at 17:14, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
> Many thanks Peter.
>
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 11 Nov 2020 at 16:58, Peter Vary <pv...@cloudera.com> wrote:
>
>> Hi Mich,
>>
>> Index support was removed from hive:
>>
>>    - https://issues.apache.org/jira/browse/HIVE-21968
>>    - https://issues.apache.org/jira/browse/HIVE-18715
>>
>>
>> Thanks,
>> Peter
>>
>> On Nov 11, 2020, at 17:25, Mich Talebzadeh <mi...@gmail.com>
>> wrote:
>>
>> Hi all,
>>
>> I wrote these notes earlier this year.
>>
>> I heard today that someone mentioned Hive 1 does not support indexes but
>> hive 2 does.
>>
>> I still believe that Hive does not support indexing as per below. Has
>> this been changed?
>>
>> Regards,
>>
>> Mich
>>
>> ---------- Forwarded message ---------
>> From: Mich Talebzadeh <mi...@gmail.com>
>> Date: Thu, 2 Apr 2020 at 12:17
>> Subject: How useful are tools for Hive data modeling
>> To: user <us...@hive.apache.org>
>>
>>
>> Hi,
>>
>> Fundamentally Hive tables have structure and support provided by desc
>> formatted <TABLE> and show partitions <TABLE>.
>>
>> Hive does not support indexes in real HQL operations (I stand corrected).
>> So what we have are tables, partitions and clustering (AKA hash
>> partitioning).
>>
>> Hive does not support indexes because Hadoop lacks blocks locality
>> necessary for indexes. So If I use a tool like Collibra, Ab-intio etc what
>> advantage(s) one is going to gain on top a simple sell scrip to get table
>> and partition definitions?
>>
>> Thanks,
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>>
>

Re: How useful are tools for Hive data modeling

Posted by Austin Hackett <ha...@me.com>.

Hi Mich

Hive also has non-validated primary key, foreign key etc constraints. Whilst I’m not too familiar with the modelling tools you mention, perhaps they’re able to use these for generating SQL etc?

ORC files have indexes (min, max, bloom filters) - not particularly relevant to the data modelling tools question, but mentioning it for completeness…

Thanks

Austin


> On 11 Nov 2020, at 17:14, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> Many thanks Peter. 
> 
> 
>  
> 
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
> 
> 
> On Wed, 11 Nov 2020 at 16:58, Peter Vary <pvary@cloudera.com <ma...@cloudera.com>> wrote:
> Hi Mich,
> 
> Index support was removed from hive:
> https://issues.apache.org/jira/browse/HIVE-21968 <https://issues.apache.org/jira/browse/HIVE-21968>
> https://issues.apache.org/jira/browse/HIVE-18715 <https://issues.apache.org/jira/browse/HIVE-18715>
> 
> Thanks,
> Peter
> 
>> On Nov 11, 2020, at 17:25, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Hi all,
>> 
>> I wrote these notes earlier this year. 
>> 
>> I heard today that someone mentioned Hive 1 does not support indexes but hive 2 does.
>> 
>> I still believe that Hive does not support indexing as per below. Has this been changed?
>> 
>> Regards,
>> 
>> Mich
>> 
>> ---------- Forwarded message ---------
>> From: Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>>
>> Date: Thu, 2 Apr 2020 at 12:17
>> Subject: How useful are tools for Hive data modeling
>> To: user <user@hive.apache.org <ma...@hive.apache.org>>
>> 
>> 
>> Hi,
>> 
>> Fundamentally Hive tables have structure and support provided by desc formatted <TABLE> and show partitions <TABLE>.
>> 
>> Hive does not support indexes in real HQL operations (I stand corrected). So what we have are tables, partitions and clustering (AKA hash partitioning). 
>> 
>> Hive does not support indexes because Hadoop lacks blocks locality necessary for indexes. So If I use a tool like Collibra, Ab-intio etc what advantage(s) one is going to gain on top a simple sell scrip to get table and partition definitions?
>> 
>> Thanks,
>> 
>> 
>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>  
>

Re: How useful are tools for Hive data modeling

Posted by Mich Talebzadeh <mi...@gmail.com>.

Many thanks Peter.




LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*





*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 11 Nov 2020 at 16:58, Peter Vary <pv...@cloudera.com> wrote:

> Hi Mich,
>
> Index support was removed from hive:
>
>    - https://issues.apache.org/jira/browse/HIVE-21968
>    - https://issues.apache.org/jira/browse/HIVE-18715
>
>
> Thanks,
> Peter
>
> On Nov 11, 2020, at 17:25, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
> Hi all,
>
> I wrote these notes earlier this year.
>
> I heard today that someone mentioned Hive 1 does not support indexes but
> hive 2 does.
>
> I still believe that Hive does not support indexing as per below. Has this
> been changed?
>
> Regards,
>
> Mich
>
> ---------- Forwarded message ---------
> From: Mich Talebzadeh <mi...@gmail.com>
> Date: Thu, 2 Apr 2020 at 12:17
> Subject: How useful are tools for Hive data modeling
> To: user <us...@hive.apache.org>
>
>
> Hi,
>
> Fundamentally Hive tables have structure and support provided by desc
> formatted <TABLE> and show partitions <TABLE>.
>
> Hive does not support indexes in real HQL operations (I stand corrected).
> So what we have are tables, partitions and clustering (AKA hash
> partitioning).
>
> Hive does not support indexes because Hadoop lacks blocks locality
> necessary for indexes. So If I use a tool like Collibra, Ab-intio etc what
> advantage(s) one is going to gain on top a simple sell scrip to get table
> and partition definitions?
>
> Thanks,
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>

Re: How useful are tools for Hive data modeling

Posted by Peter Vary <pv...@cloudera.com>.

Hi Mich,

Index support was removed from hive:
https://issues.apache.org/jira/browse/HIVE-21968 <https://issues.apache.org/jira/browse/HIVE-21968>
https://issues.apache.org/jira/browse/HIVE-18715 <https://issues.apache.org/jira/browse/HIVE-18715>

Thanks,
Peter

> On Nov 11, 2020, at 17:25, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> Hi all,
> 
> I wrote these notes earlier this year. 
> 
> I heard today that someone mentioned Hive 1 does not support indexes but hive 2 does.
> 
> I still believe that Hive does not support indexing as per below. Has this been changed?
> 
> Regards,
> 
> Mich
> 
> ---------- Forwarded message ---------
> From: Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>>
> Date: Thu, 2 Apr 2020 at 12:17
> Subject: How useful are tools for Hive data modeling
> To: user <user@hive.apache.org <ma...@hive.apache.org>>
> 
> 
> Hi,
> 
> Fundamentally Hive tables have structure and support provided by desc formatted <TABLE> and show partitions <TABLE>.
> 
> Hive does not support indexes in real HQL operations (I stand corrected). So what we have are tables, partitions and clustering (AKA hash partitioning). 
> 
> Hive does not support indexes because Hadoop lacks blocks locality necessary for indexes. So If I use a tool like Collibra, Ab-intio etc what advantage(s) one is going to gain on top a simple sell scrip to get table and partition definitions?
> 
> Thanks,
> 
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>