You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Mich Talebzadeh <mi...@gmail.com> on 2016/04/19 00:34:57 UTC

Hive footprint

Hi,

I notice that Impala is rarely mentioned these days.  I may be missing
something. However, I gather it is coming to end now as I don't recall many
use cases for it (or customers asking for it). In contrast, Hive has hold
its ground with the new addition of Spark and Tez as execution engines,
support for ACID and ORC and new stuff in Hive 2. In addition provided a
good choice for its metastore it scales well.

If Hive had the ability (organic) to have local variable and stored
procedure support then it would be top notch Data Warehouse. Given its
metastore, I don't see any technical reason why it cannot support these
constructs.

I was recently asked to comment on migration from commercial DWs to Big
Data (primarily for TCO reason) and really could not recall any better
candidate than Hive. Is HBase a viable alternative? Obviously whatever one
decides there is still HDFS, a good engine for Hive (sounds like many
prefer TEZ although I am a Spark fan) and the ubiquitous YARN.

Let me know your thoughts.


Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com

Re: Hive footprint

Posted by Mich Talebzadeh <mi...@gmail.com>.

BTW what is the situation with Impala?

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 19 April 2016 at 08:31, Mich Talebzadeh <mi...@gmail.com>
wrote:

> The issue is that Hive has indexes (not index store) but they don't work
> so there we go. May be in later releases we can make use of these indexes
> for faster queries. Hive allows even bitmap indexes on Fact table but they
> are never used by COB.
>
> show indexes on sales;
>
>
> +-----------------------+-----------------------+-----------------------+------------------------------------------+-----------------------+----------+--+
> |       idx_name        |       tab_name        |       col_names
> |               idx_tab_name               |       idx_type        |
> comment  |
>
> +-----------------------+-----------------------+-----------------------+------------------------------------------+-----------------------+----------+--+
> | sales_cust_bix        | sales                 | cust_id               |
> oraclehadoop__sales_sales_cust_bix__     | bitmap                |
> |
> | sales_channel_bix     | sales                 | channel_id            |
> oraclehadoop__sales_sales_channel_bix__  | bitmap                |
> |
> | sales_prod_bix        | sales                 | prod_id               |
> oraclehadoop__sales_sales_prod_bix__     | bitmap                |
> |
> | sales_promo_bix       | sales                 | promo_id              |
> oraclehadoop__sales_sales_promo_bix__    | bitmap                |
> |
> | sales_time_bix        | sales                 | time_id               |
> oraclehadoop__sales_sales_time_bix__     | bitmap                |
> |
>
> +-----------------------+-----------------------+-----------------------+------------------------------------------+-----------------------+----------+--+
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 18 April 2016 at 23:51, Marcin Tustin <mt...@handybook.com> wrote:
>
>> We use a hive with ORC setup now. Queries may take thousands of seconds
>> with joins, and potentially tens of seconds with selects on very large
>> tables.
>>
>> My understanding is that the goal of hbase is to provide much lower
>> latency for queries. Obviously, this comes at the cost of not being able to
>> perform joins. I don't actually use hbase, so I hesitate to say more about
>> it.
>>
>> On Mon, Apr 18, 2016 at 6:48 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> Thanks Marcin.
>>>
>>> What is the definition of low latency here? Are you referring to the
>>> performance of SQL against HBase tables compared to Hive. As I understand
>>> HBase is a columnar database. Would it be possible to use Hive against ORC
>>> to achieve the same?
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 18 April 2016 at 23:43, Marcin Tustin <mt...@handybook.com> wrote:
>>>
>>>> HBase has a different use case - it's for low-latency querying of big
>>>> tables. If you combined it with Hive, you might have something nice for
>>>> certain queries, but I wouldn't think of them as direct competitors.
>>>>
>>>> On Mon, Apr 18, 2016 at 6:34 PM, Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I notice that Impala is rarely mentioned these days.  I may be missing
>>>>> something. However, I gather it is coming to end now as I don't recall many
>>>>> use cases for it (or customers asking for it). In contrast, Hive has hold
>>>>> its ground with the new addition of Spark and Tez as execution engines,
>>>>> support for ACID and ORC and new stuff in Hive 2. In addition provided a
>>>>> good choice for its metastore it scales well.
>>>>>
>>>>> If Hive had the ability (organic) to have local variable and stored
>>>>> procedure support then it would be top notch Data Warehouse. Given its
>>>>> metastore, I don't see any technical reason why it cannot support these
>>>>> constructs.
>>>>>
>>>>> I was recently asked to comment on migration from commercial DWs to
>>>>> Big Data (primarily for TCO reason) and really could not recall any better
>>>>> candidate than Hive. Is HBase a viable alternative? Obviously whatever one
>>>>> decides there is still HDFS, a good engine for Hive (sounds like many
>>>>> prefer TEZ although I am a Spark fan) and the ubiquitous YARN.
>>>>>
>>>>> Let me know your thoughts.
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> Want to work at Handy? Check out our culture deck and open roles
>>>> <http://www.handy.com/careers>
>>>> Latest news <http://www.handy.com/press> at Handy
>>>> Handy just raised $50m
>>>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
>>>> by Fidelity
>>>>
>>>>
>>>
>>
>> Want to work at Handy? Check out our culture deck and open roles
>> <http://www.handy.com/careers>
>> Latest news <http://www.handy.com/press> at Handy
>> Handy just raised $50m
>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
>> by Fidelity
>>
>>
>

Re: Hive footprint

Posted by Mich Talebzadeh <mi...@gmail.com>.

This simply does not work but we need to make Hive use external indexes.
This is a must

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 20 April 2016 at 19:37, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Hi,
>
> If I may, I would also like to see where the Hive optimizer shows that it
> is used with explain ... or other means. It will be interesting.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 20 April 2016 at 19:20, Marcin Tustin <mt...@handybook.com> wrote:
>
>> Could you expand on this? This sounds like something that would be great
>> to know, and probably fold into the wiki.
>>
>> On Wed, Apr 20, 2016 at 11:57 AM, Jörn Franke <jo...@gmail.com>
>> wrote:
>>
>>> Hive has working indexes. However many people overlook that a block is
>>> usually much larger than in a relational database and thus do not use them
>>> right.
>>>
>>> On 19 Apr 2016, at 09:31, Mich Talebzadeh <mi...@gmail.com>
>>> wrote:
>>>
>>> The issue is that Hive has indexes (not index store) but they don't work
>>> so there we go. May be in later releases we can make use of these indexes
>>> for faster queries. Hive allows even bitmap indexes on Fact table but they
>>> are never used by COB.
>>>
>>> show indexes on sales;
>>>
>>>
>>> +-----------------------+-----------------------+-----------------------+------------------------------------------+-----------------------+----------+--+
>>> |       idx_name        |       tab_name        |       col_names
>>> |               idx_tab_name               |       idx_type        |
>>> comment  |
>>>
>>> +-----------------------+-----------------------+-----------------------+------------------------------------------+-----------------------+----------+--+
>>> | sales_cust_bix        | sales                 | cust_id
>>> | oraclehadoop__sales_sales_cust_bix__     | bitmap
>>> |          |
>>> | sales_channel_bix     | sales                 | channel_id
>>> | oraclehadoop__sales_sales_channel_bix__  | bitmap
>>> |          |
>>> | sales_prod_bix        | sales                 | prod_id
>>> | oraclehadoop__sales_sales_prod_bix__     | bitmap
>>> |          |
>>> | sales_promo_bix       | sales                 | promo_id
>>> | oraclehadoop__sales_sales_promo_bix__    | bitmap
>>> |          |
>>> | sales_time_bix        | sales                 | time_id
>>> | oraclehadoop__sales_sales_time_bix__     | bitmap
>>> |          |
>>>
>>> +-----------------------+-----------------------+-----------------------+------------------------------------------+-----------------------+----------+--+
>>>
>>>
>>>
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 18 April 2016 at 23:51, Marcin Tustin <mt...@handybook.com> wrote:
>>>
>>>> We use a hive with ORC setup now. Queries may take thousands of seconds
>>>> with joins, and potentially tens of seconds with selects on very large
>>>> tables.
>>>>
>>>> My understanding is that the goal of hbase is to provide much lower
>>>> latency for queries. Obviously, this comes at the cost of not being able to
>>>> perform joins. I don't actually use hbase, so I hesitate to say more about
>>>> it.
>>>>
>>>> On Mon, Apr 18, 2016 at 6:48 PM, Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com> wrote:
>>>>
>>>>> Thanks Marcin.
>>>>>
>>>>> What is the definition of low latency here? Are you referring to the
>>>>> performance of SQL against HBase tables compared to Hive. As I understand
>>>>> HBase is a columnar database. Would it be possible to use Hive against ORC
>>>>> to achieve the same?
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> On 18 April 2016 at 23:43, Marcin Tustin <mt...@handybook.com>
>>>>> wrote:
>>>>>
>>>>>> HBase has a different use case - it's for low-latency querying of big
>>>>>> tables. If you combined it with Hive, you might have something nice for
>>>>>> certain queries, but I wouldn't think of them as direct competitors.
>>>>>>
>>>>>> On Mon, Apr 18, 2016 at 6:34 PM, Mich Talebzadeh <
>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I notice that Impala is rarely mentioned these days.  I may be
>>>>>>> missing something. However, I gather it is coming to end now as I don't
>>>>>>> recall many use cases for it (or customers asking for it). In contrast,
>>>>>>> Hive has hold its ground with the new addition of Spark and Tez as
>>>>>>> execution engines, support for ACID and ORC and new stuff in Hive 2. In
>>>>>>> addition provided a good choice for its metastore it scales well.
>>>>>>>
>>>>>>> If Hive had the ability (organic) to have local variable and stored
>>>>>>> procedure support then it would be top notch Data Warehouse. Given its
>>>>>>> metastore, I don't see any technical reason why it cannot support these
>>>>>>> constructs.
>>>>>>>
>>>>>>> I was recently asked to comment on migration from commercial DWs to
>>>>>>> Big Data (primarily for TCO reason) and really could not recall any better
>>>>>>> candidate than Hive. Is HBase a viable alternative? Obviously whatever one
>>>>>>> decides there is still HDFS, a good engine for Hive (sounds like many
>>>>>>> prefer TEZ although I am a Spark fan) and the ubiquitous YARN.
>>>>>>>
>>>>>>> Let me know your thoughts.
>>>>>>>
>>>>>>>
>>>>>>> Dr Mich Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> Want to work at Handy? Check out our culture deck and open roles
>>>>>> <http://www.handy.com/careers>
>>>>>> Latest news <http://www.handy.com/press> at Handy
>>>>>> Handy just raised $50m
>>>>>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
>>>>>> by Fidelity
>>>>>>
>>>>>>
>>>>>
>>>>
>>>> Want to work at Handy? Check out our culture deck and open roles
>>>> <http://www.handy.com/careers>
>>>> Latest news <http://www.handy.com/press> at Handy
>>>> Handy just raised $50m
>>>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
>>>> by Fidelity
>>>>
>>>>
>>>
>>
>> Want to work at Handy? Check out our culture deck and open roles
>> <http://www.handy.com/careers>
>> Latest news <http://www.handy.com/press> at Handy
>> Handy just raised $50m
>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
>> by Fidelity
>>
>>
>

Re: Hive footprint

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi,

If I may, I would also like to see where the Hive optimizer shows that it
is used with explain ... or other means. It will be interesting.

HTH

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 20 April 2016 at 19:20, Marcin Tustin <mt...@handybook.com> wrote:

> Could you expand on this? This sounds like something that would be great
> to know, and probably fold into the wiki.
>
> On Wed, Apr 20, 2016 at 11:57 AM, Jörn Franke <jo...@gmail.com>
> wrote:
>
>> Hive has working indexes. However many people overlook that a block is
>> usually much larger than in a relational database and thus do not use them
>> right.
>>
>> On 19 Apr 2016, at 09:31, Mich Talebzadeh <mi...@gmail.com>
>> wrote:
>>
>> The issue is that Hive has indexes (not index store) but they don't work
>> so there we go. May be in later releases we can make use of these indexes
>> for faster queries. Hive allows even bitmap indexes on Fact table but they
>> are never used by COB.
>>
>> show indexes on sales;
>>
>>
>> +-----------------------+-----------------------+-----------------------+------------------------------------------+-----------------------+----------+--+
>> |       idx_name        |       tab_name        |       col_names
>> |               idx_tab_name               |       idx_type        |
>> comment  |
>>
>> +-----------------------+-----------------------+-----------------------+------------------------------------------+-----------------------+----------+--+
>> | sales_cust_bix        | sales                 | cust_id               |
>> oraclehadoop__sales_sales_cust_bix__     | bitmap                |
>> |
>> | sales_channel_bix     | sales                 | channel_id            |
>> oraclehadoop__sales_sales_channel_bix__  | bitmap                |
>> |
>> | sales_prod_bix        | sales                 | prod_id               |
>> oraclehadoop__sales_sales_prod_bix__     | bitmap                |
>> |
>> | sales_promo_bix       | sales                 | promo_id              |
>> oraclehadoop__sales_sales_promo_bix__    | bitmap                |
>> |
>> | sales_time_bix        | sales                 | time_id               |
>> oraclehadoop__sales_sales_time_bix__     | bitmap                |
>> |
>>
>> +-----------------------+-----------------------+-----------------------+------------------------------------------+-----------------------+----------+--+
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 18 April 2016 at 23:51, Marcin Tustin <mt...@handybook.com> wrote:
>>
>>> We use a hive with ORC setup now. Queries may take thousands of seconds
>>> with joins, and potentially tens of seconds with selects on very large
>>> tables.
>>>
>>> My understanding is that the goal of hbase is to provide much lower
>>> latency for queries. Obviously, this comes at the cost of not being able to
>>> perform joins. I don't actually use hbase, so I hesitate to say more about
>>> it.
>>>
>>> On Mon, Apr 18, 2016 at 6:48 PM, Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> wrote:
>>>
>>>> Thanks Marcin.
>>>>
>>>> What is the definition of low latency here? Are you referring to the
>>>> performance of SQL against HBase tables compared to Hive. As I understand
>>>> HBase is a columnar database. Would it be possible to use Hive against ORC
>>>> to achieve the same?
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 18 April 2016 at 23:43, Marcin Tustin <mt...@handybook.com> wrote:
>>>>
>>>>> HBase has a different use case - it's for low-latency querying of big
>>>>> tables. If you combined it with Hive, you might have something nice for
>>>>> certain queries, but I wouldn't think of them as direct competitors.
>>>>>
>>>>> On Mon, Apr 18, 2016 at 6:34 PM, Mich Talebzadeh <
>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I notice that Impala is rarely mentioned these days.  I may be
>>>>>> missing something. However, I gather it is coming to end now as I don't
>>>>>> recall many use cases for it (or customers asking for it). In contrast,
>>>>>> Hive has hold its ground with the new addition of Spark and Tez as
>>>>>> execution engines, support for ACID and ORC and new stuff in Hive 2. In
>>>>>> addition provided a good choice for its metastore it scales well.
>>>>>>
>>>>>> If Hive had the ability (organic) to have local variable and stored
>>>>>> procedure support then it would be top notch Data Warehouse. Given its
>>>>>> metastore, I don't see any technical reason why it cannot support these
>>>>>> constructs.
>>>>>>
>>>>>> I was recently asked to comment on migration from commercial DWs to
>>>>>> Big Data (primarily for TCO reason) and really could not recall any better
>>>>>> candidate than Hive. Is HBase a viable alternative? Obviously whatever one
>>>>>> decides there is still HDFS, a good engine for Hive (sounds like many
>>>>>> prefer TEZ although I am a Spark fan) and the ubiquitous YARN.
>>>>>>
>>>>>> Let me know your thoughts.
>>>>>>
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> Want to work at Handy? Check out our culture deck and open roles
>>>>> <http://www.handy.com/careers>
>>>>> Latest news <http://www.handy.com/press> at Handy
>>>>> Handy just raised $50m
>>>>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
>>>>> by Fidelity
>>>>>
>>>>>
>>>>
>>>
>>> Want to work at Handy? Check out our culture deck and open roles
>>> <http://www.handy.com/careers>
>>> Latest news <http://www.handy.com/press> at Handy
>>> Handy just raised $50m
>>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
>>> by Fidelity
>>>
>>>
>>
>
> Want to work at Handy? Check out our culture deck and open roles
> <http://www.handy.com/careers>
> Latest news <http://www.handy.com/press> at Handy
> Handy just raised $50m
> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
> by Fidelity
>
>

Re: Hive footprint

Posted by Marcin Tustin <mt...@handybook.com>.

Could you expand on this? This sounds like something that would be great to
know, and probably fold into the wiki.

On Wed, Apr 20, 2016 at 11:57 AM, Jörn Franke <jo...@gmail.com> wrote:

> Hive has working indexes. However many people overlook that a block is
> usually much larger than in a relational database and thus do not use them
> right.
>
> On 19 Apr 2016, at 09:31, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
> The issue is that Hive has indexes (not index store) but they don't work
> so there we go. May be in later releases we can make use of these indexes
> for faster queries. Hive allows even bitmap indexes on Fact table but they
> are never used by COB.
>
> show indexes on sales;
>
>
> +-----------------------+-----------------------+-----------------------+------------------------------------------+-----------------------+----------+--+
> |       idx_name        |       tab_name        |       col_names
> |               idx_tab_name               |       idx_type        |
> comment  |
>
> +-----------------------+-----------------------+-----------------------+------------------------------------------+-----------------------+----------+--+
> | sales_cust_bix        | sales                 | cust_id               |
> oraclehadoop__sales_sales_cust_bix__     | bitmap                |
> |
> | sales_channel_bix     | sales                 | channel_id            |
> oraclehadoop__sales_sales_channel_bix__  | bitmap                |
> |
> | sales_prod_bix        | sales                 | prod_id               |
> oraclehadoop__sales_sales_prod_bix__     | bitmap                |
> |
> | sales_promo_bix       | sales                 | promo_id              |
> oraclehadoop__sales_sales_promo_bix__    | bitmap                |
> |
> | sales_time_bix        | sales                 | time_id               |
> oraclehadoop__sales_sales_time_bix__     | bitmap                |
> |
>
> +-----------------------+-----------------------+-----------------------+------------------------------------------+-----------------------+----------+--+
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 18 April 2016 at 23:51, Marcin Tustin <mt...@handybook.com> wrote:
>
>> We use a hive with ORC setup now. Queries may take thousands of seconds
>> with joins, and potentially tens of seconds with selects on very large
>> tables.
>>
>> My understanding is that the goal of hbase is to provide much lower
>> latency for queries. Obviously, this comes at the cost of not being able to
>> perform joins. I don't actually use hbase, so I hesitate to say more about
>> it.
>>
>> On Mon, Apr 18, 2016 at 6:48 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> Thanks Marcin.
>>>
>>> What is the definition of low latency here? Are you referring to the
>>> performance of SQL against HBase tables compared to Hive. As I understand
>>> HBase is a columnar database. Would it be possible to use Hive against ORC
>>> to achieve the same?
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 18 April 2016 at 23:43, Marcin Tustin <mt...@handybook.com> wrote:
>>>
>>>> HBase has a different use case - it's for low-latency querying of big
>>>> tables. If you combined it with Hive, you might have something nice for
>>>> certain queries, but I wouldn't think of them as direct competitors.
>>>>
>>>> On Mon, Apr 18, 2016 at 6:34 PM, Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I notice that Impala is rarely mentioned these days.  I may be missing
>>>>> something. However, I gather it is coming to end now as I don't recall many
>>>>> use cases for it (or customers asking for it). In contrast, Hive has hold
>>>>> its ground with the new addition of Spark and Tez as execution engines,
>>>>> support for ACID and ORC and new stuff in Hive 2. In addition provided a
>>>>> good choice for its metastore it scales well.
>>>>>
>>>>> If Hive had the ability (organic) to have local variable and stored
>>>>> procedure support then it would be top notch Data Warehouse. Given its
>>>>> metastore, I don't see any technical reason why it cannot support these
>>>>> constructs.
>>>>>
>>>>> I was recently asked to comment on migration from commercial DWs to
>>>>> Big Data (primarily for TCO reason) and really could not recall any better
>>>>> candidate than Hive. Is HBase a viable alternative? Obviously whatever one
>>>>> decides there is still HDFS, a good engine for Hive (sounds like many
>>>>> prefer TEZ although I am a Spark fan) and the ubiquitous YARN.
>>>>>
>>>>> Let me know your thoughts.
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> Want to work at Handy? Check out our culture deck and open roles
>>>> <http://www.handy.com/careers>
>>>> Latest news <http://www.handy.com/press> at Handy
>>>> Handy just raised $50m
>>>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
>>>> by Fidelity
>>>>
>>>>
>>>
>>
>> Want to work at Handy? Check out our culture deck and open roles
>> <http://www.handy.com/careers>
>> Latest news <http://www.handy.com/press> at Handy
>> Handy just raised $50m
>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
>> by Fidelity
>>
>>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 
<http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m 
<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led 
by Fidelity

Re: Hive footprint

Posted by Jörn Franke <jo...@gmail.com>.

Hive has working indexes. However many people overlook that a block is usually much larger than in a relational database and thus do not use them right.

> On 19 Apr 2016, at 09:31, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> The issue is that Hive has indexes (not index store) but they don't work so there we go. May be in later releases we can make use of these indexes for faster queries. Hive allows even bitmap indexes on Fact table but they are never used by COB.
> 
> show indexes on sales;
> 
> +-----------------------+-----------------------+-----------------------+------------------------------------------+-----------------------+----------+--+
> |       idx_name        |       tab_name        |       col_names       |               idx_tab_name               |       idx_type        | comment  |
> +-----------------------+-----------------------+-----------------------+------------------------------------------+-----------------------+----------+--+
> | sales_cust_bix        | sales                 | cust_id               | oraclehadoop__sales_sales_cust_bix__     | bitmap                |          |
> | sales_channel_bix     | sales                 | channel_id            | oraclehadoop__sales_sales_channel_bix__  | bitmap                |          |
> | sales_prod_bix        | sales                 | prod_id               | oraclehadoop__sales_sales_prod_bix__     | bitmap                |          |
> | sales_promo_bix       | sales                 | promo_id              | oraclehadoop__sales_sales_promo_bix__    | bitmap                |          |
> | sales_time_bix        | sales                 | time_id               | oraclehadoop__sales_sales_time_bix__     | bitmap                |          |
> +-----------------------+-----------------------+-----------------------+------------------------------------------+-----------------------+----------+--+
> 
> 
> 
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>  
> 
>> On 18 April 2016 at 23:51, Marcin Tustin <mt...@handybook.com> wrote:
>> We use a hive with ORC setup now. Queries may take thousands of seconds with joins, and potentially tens of seconds with selects on very large tables. 
>> 
>> My understanding is that the goal of hbase is to provide much lower latency for queries. Obviously, this comes at the cost of not being able to perform joins. I don't actually use hbase, so I hesitate to say more about it. 
>> 
>>> On Mon, Apr 18, 2016 at 6:48 PM, Mich Talebzadeh <mi...@gmail.com> wrote:
>>> Thanks Marcin.
>>> 
>>> What is the definition of low latency here? Are you referring to the performance of SQL against HBase tables compared to Hive. As I understand HBase is a columnar database. Would it be possible to use Hive against ORC to achieve the same?
>>> 
>>> Dr Mich Talebzadeh
>>>  
>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>  
>>> http://talebzadehmich.wordpress.com
>>>  
>>> 
>>>> On 18 April 2016 at 23:43, Marcin Tustin <mt...@handybook.com> wrote:
>>>> HBase has a different use case - it's for low-latency querying of big tables. If you combined it with Hive, you might have something nice for certain queries, but I wouldn't think of them as direct competitors.
>>>> 
>>>>> On Mon, Apr 18, 2016 at 6:34 PM, Mich Talebzadeh <mi...@gmail.com> wrote:
>>>>> Hi,
>>>>> 
>>>>> I notice that Impala is rarely mentioned these days.  I may be missing something. However, I gather it is coming to end now as I don't recall many use cases for it (or customers asking for it). In contrast, Hive has hold its ground with the new addition of Spark and Tez as execution engines, support for ACID and ORC and new stuff in Hive 2. In addition provided a good choice for its metastore it scales well.
>>>>> 
>>>>> If Hive had the ability (organic) to have local variable and stored procedure support then it would be top notch Data Warehouse. Given its metastore, I don't see any technical reason why it cannot support these constructs.
>>>>> 
>>>>> I was recently asked to comment on migration from commercial DWs to Big Data (primarily for TCO reason) and really could not recall any better candidate than Hive. Is HBase a viable alternative? Obviously whatever one decides there is still HDFS, a good engine for Hive (sounds like many prefer TEZ although I am a Spark fan) and the ubiquitous YARN.
>>>>> 
>>>>> Let me know your thoughts.
>>>>> 
>>>>> 
>>>>> Dr Mich Talebzadeh
>>>>>  
>>>>> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>  
>>>>> http://talebzadehmich.wordpress.com
>>>> 
>>>> 
>>>> Want to work at Handy? Check out our culture deck and open roles
>>>> Latest news at Handy
>>>> Handy just raised $50m led by Fidelity
>>>> 
>>>> 
>> 
>> 
>> Want to work at Handy? Check out our culture deck and open roles
>> Latest news at Handy
>> Handy just raised $50m led by Fidelity
>> 
>> 
>

Re: Hive footprint

Posted by Mich Talebzadeh <mi...@gmail.com>.

The issue is that Hive has indexes (not index store) but they don't work so
there we go. May be in later releases we can make use of these indexes for
faster queries. Hive allows even bitmap indexes on Fact table but they are
never used by COB.

show indexes on sales;

+-----------------------+-----------------------+-----------------------+------------------------------------------+-----------------------+----------+--+
|       idx_name        |       tab_name        |       col_names
|               idx_tab_name               |       idx_type        |
comment  |
+-----------------------+-----------------------+-----------------------+------------------------------------------+-----------------------+----------+--+
| sales_cust_bix        | sales                 | cust_id               |
oraclehadoop__sales_sales_cust_bix__     | bitmap                |
|
| sales_channel_bix     | sales                 | channel_id            |
oraclehadoop__sales_sales_channel_bix__  | bitmap                |
|
| sales_prod_bix        | sales                 | prod_id               |
oraclehadoop__sales_sales_prod_bix__     | bitmap                |
|
| sales_promo_bix       | sales                 | promo_id              |
oraclehadoop__sales_sales_promo_bix__    | bitmap                |
|
| sales_time_bix        | sales                 | time_id               |
oraclehadoop__sales_sales_time_bix__     | bitmap                |
|
+-----------------------+-----------------------+-----------------------+------------------------------------------+-----------------------+----------+--+





Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 18 April 2016 at 23:51, Marcin Tustin <mt...@handybook.com> wrote:

> We use a hive with ORC setup now. Queries may take thousands of seconds
> with joins, and potentially tens of seconds with selects on very large
> tables.
>
> My understanding is that the goal of hbase is to provide much lower
> latency for queries. Obviously, this comes at the cost of not being able to
> perform joins. I don't actually use hbase, so I hesitate to say more about
> it.
>
> On Mon, Apr 18, 2016 at 6:48 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> Thanks Marcin.
>>
>> What is the definition of low latency here? Are you referring to the
>> performance of SQL against HBase tables compared to Hive. As I understand
>> HBase is a columnar database. Would it be possible to use Hive against ORC
>> to achieve the same?
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 18 April 2016 at 23:43, Marcin Tustin <mt...@handybook.com> wrote:
>>
>>> HBase has a different use case - it's for low-latency querying of big
>>> tables. If you combined it with Hive, you might have something nice for
>>> certain queries, but I wouldn't think of them as direct competitors.
>>>
>>> On Mon, Apr 18, 2016 at 6:34 PM, Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I notice that Impala is rarely mentioned these days.  I may be missing
>>>> something. However, I gather it is coming to end now as I don't recall many
>>>> use cases for it (or customers asking for it). In contrast, Hive has hold
>>>> its ground with the new addition of Spark and Tez as execution engines,
>>>> support for ACID and ORC and new stuff in Hive 2. In addition provided a
>>>> good choice for its metastore it scales well.
>>>>
>>>> If Hive had the ability (organic) to have local variable and stored
>>>> procedure support then it would be top notch Data Warehouse. Given its
>>>> metastore, I don't see any technical reason why it cannot support these
>>>> constructs.
>>>>
>>>> I was recently asked to comment on migration from commercial DWs to Big
>>>> Data (primarily for TCO reason) and really could not recall any better
>>>> candidate than Hive. Is HBase a viable alternative? Obviously whatever one
>>>> decides there is still HDFS, a good engine for Hive (sounds like many
>>>> prefer TEZ although I am a Spark fan) and the ubiquitous YARN.
>>>>
>>>> Let me know your thoughts.
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>
>>>
>>> Want to work at Handy? Check out our culture deck and open roles
>>> <http://www.handy.com/careers>
>>> Latest news <http://www.handy.com/press> at Handy
>>> Handy just raised $50m
>>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
>>> by Fidelity
>>>
>>>
>>
>
> Want to work at Handy? Check out our culture deck and open roles
> <http://www.handy.com/careers>
> Latest news <http://www.handy.com/press> at Handy
> Handy just raised $50m
> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
> by Fidelity
>
>

Re: Hive footprint

Posted by Mich Talebzadeh <mi...@gmail.com>.

A caveat here.

An OLTP database much like Oracle or SAP ASE will use indexes for point
queries in other words when the search is via index scan. In that case the
search will be very fast because typically few blocks will be needed using
Index scan and using RowID pointer to the underlying data blocks to get the
records from the disk.

When an OLAP type read is required there is a lesser need for index as the
optimiser does a serial scan and the work would be pretty efficient. As a
rule of sum (if I am correct), if Oracle CBO decides that the result set
will be more than 4% of the underlying rows it will favour a table scan

The issue with Hive are two fold (excluding storage index in ORC tables)

1) Hive does not take advantage of indexes (index in a conventional sense)
at the moment. Yes you can even create bitwise indexes on FACT tables in
Hive but they are not used by the Optimiser yet.
0: jdbc:hive2://rhes564:10010/default> show index on sales;
INFO  : OK
+-----------------------+-----------------------+-----------------------+------------------------------------------+-----------------------+----------+--+
|       idx_name        |       tab_name        |       col_names
|               idx_tab_name               |       idx_type        |
comment  |
+-----------------------+-----------------------+-----------------------+------------------------------------------+-----------------------+----------+--+
| sales_cust_bix        | sales                 | cust_id               |
oraclehadoop__sales_sales_cust_bix__     | bitmap                |
|
| sales_channel_bix     | sales                 | channel_id            |
oraclehadoop__sales_sales_channel_bix__  | bitmap                |
|
| sales_prod_bix        | sales                 | prod_id               |
oraclehadoop__sales_sales_prod_bix__     | bitmap                |
|
| sales_promo_bix       | sales                 | promo_id              |
oraclehadoop__sales_sales_promo_bix__    | bitmap                |
|
| sales_time_bix        | sales                 | time_id               |
oraclehadoop__sales_sales_time_bix__     | bitmap                |
|
+-----------------------+-----------------------+-----------------------+------------------------------------------+-----------------------+----------+--+

2) The blocks in Hive table are not stored sequentially. Actually the issue
with this is that HDFS lacks the ability to co-locate blocks. So really
table scan in the sense of conventional RDBMS does not exist. However I
believe there are plans to start making indexes available in Hive for COB
which in that case indexes will speed up the queries. Alan Gates may have
more on this.

HTH,




Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 20 April 2016 at 13:07, Sabarish Sasidharan <sa...@gmail.com>
wrote:

> HBase is very good for direct key based lookups. And when you want to do
> scans for a range of keys (data is sorted by keys)
>
> Whereas Hive is not good for seeks (needle in haystack problem). You can
> optimize with ORCs, stripes, sorting etc. But still it is a needle in a
> haystack problem.
>
> Apache Kylin takes a different approach. It maintains the cubes in HBase
> but routes adhoc queries to Hive. So that's one way to see them as
> complementary technologies solving problems relevant to their space in an
> efficient manner.
>
> Regards
> Sab
>
> On Wed, Apr 20, 2016 at 11:20 AM, Amey Barve <am...@gmail.com>
> wrote:
>
>> Thanks Peyman,
>>
>> Is running and evaluating TPCH queries with HBaseStorageHandler vs
>> Hive's Text format comparable?
>> What is the standard set of queries generally used for performance
>> comparison, What queries did you use above?
>>
>> Regards,
>> Amey
>>
>>
>>
>> On Tue, Apr 19, 2016 at 7:28 PM, Peyman Mohajerian <mo...@gmail.com>
>> wrote:
>>
>>> Hi Amey,
>>>
>>> It is about seek vs scan. HBase is great in case a rowkey or a range of
>>> rowkeys is part of the where clause, then you do a seek and ORC/Parquest
>>> reading off HDFS would not do better in absence of an index. However for
>>> Data Warehouse that is generally not what you do, you mostly do scan, e.g.
>>> doing aggregation you aren't looking for a particular record(s). In this
>>> case the IO throughput dominates (generally), because you have to read lots
>>> of data, then reading large blocks of data and using headers info
>>> (predicate push-down) in ORC or Parquet will be faster compared to reading
>>> lots of HFiles in HBase. Of course compaction in HBase can turn the files
>>> to larger chunks but still 'typically' it will be slower.
>>> I should super emphasized that making statements about what is faster or
>>> not is very dangerous, there could be many exceptions depending on the type
>>> of query and other factors. When I did this test I was using map/reduce and
>>> with newer engines queries will be faster. Also caching in HBase is
>>> critical, if all you data is cached and you got lots of memory and system
>>> isn't busy handling compaction and lots of new write then your read
>>> performance in all cases will improve. Always do your own POC and use your
>>> own data to test.
>>>
>>> Thanks,
>>> Peyman
>>>
>>>
>>>
>>> On Tue, Apr 19, 2016 at 2:26 AM, Amey Barve <am...@gmail.com>
>>> wrote:
>>>
>>>> Hi Peyman,
>>>>
>>>> You say: "you can use Hive storage handler to read data from HBase the
>>>> performance would be lower than reading from HDFS directly for analytic."
>>>> Why is it so? Is it slow as compared to ORC, Parquet, and even Text
>>>> file format?
>>>>
>>>> Regards,
>>>> Amey
>>>>
>>>> On Tue, Apr 19, 2016 at 4:32 AM, Peyman Mohajerian <mo...@gmail.com>
>>>> wrote:
>>>>
>>>>> HBase can handle high read/write throughput, e.g. IOT use cases. It is
>>>>> not an analytic engine even though you can use Hive storage handler to read
>>>>> data from HBase the performance would be lower than reading from HDFS
>>>>> directly for analytic.  But HBase has index, rowkey and you can add
>>>>> secondary index, usually with Elasticsearch or other means. You can also
>>>>> run Phoenix over HBase to do analytic but again only if you data
>>>>> collection/use case mandates HBase, e.g. small amount of data from millions
>>>>> of devices. It is common to copy data from HBase to HDFS (even though HBase
>>>>> is sitting on top of HDFS), as ORC/Parquet for very large analytic. But
>>>>> again you do have the choice of using Phoenix or Hive to run analytic over
>>>>> HBase if you don't want to pay for the cost of data copying.
>>>>> HBase can only be part of a DW solution in a limited way, e.g. as
>>>>> index to data in HDFS, partition discovery, etc. Pretty soon it will be the
>>>>> metadata for Hive (optional instead of RDMS). HBase can  sits on the edge
>>>>> of DW for collect fast landing data.
>>>>> I don't see any compete between Hive and HBase, they work together and
>>>>> I don't see modern DW having a monolithic engine, Tez+Spark+MPP+...
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Apr 18, 2016 at 3:51 PM, Marcin Tustin <mt...@handybook.com>
>>>>> wrote:
>>>>>
>>>>>> We use a hive with ORC setup now. Queries may take thousands of
>>>>>> seconds with joins, and potentially tens of seconds with selects on very
>>>>>> large tables.
>>>>>>
>>>>>> My understanding is that the goal of hbase is to provide much lower
>>>>>> latency for queries. Obviously, this comes at the cost of not being able to
>>>>>> perform joins. I don't actually use hbase, so I hesitate to say more about
>>>>>> it.
>>>>>>
>>>>>> On Mon, Apr 18, 2016 at 6:48 PM, Mich Talebzadeh <
>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks Marcin.
>>>>>>>
>>>>>>> What is the definition of low latency here? Are you referring to the
>>>>>>> performance of SQL against HBase tables compared to Hive. As I understand
>>>>>>> HBase is a columnar database. Would it be possible to use Hive against ORC
>>>>>>> to achieve the same?
>>>>>>>
>>>>>>> Dr Mich Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 18 April 2016 at 23:43, Marcin Tustin <mt...@handybook.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> HBase has a different use case - it's for low-latency querying of
>>>>>>>> big tables. If you combined it with Hive, you might have something nice for
>>>>>>>> certain queries, but I wouldn't think of them as direct competitors.
>>>>>>>>
>>>>>>>> On Mon, Apr 18, 2016 at 6:34 PM, Mich Talebzadeh <
>>>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I notice that Impala is rarely mentioned these days.  I may be
>>>>>>>>> missing something. However, I gather it is coming to end now as I don't
>>>>>>>>> recall many use cases for it (or customers asking for it). In contrast,
>>>>>>>>> Hive has hold its ground with the new addition of Spark and Tez as
>>>>>>>>> execution engines, support for ACID and ORC and new stuff in Hive 2. In
>>>>>>>>> addition provided a good choice for its metastore it scales well.
>>>>>>>>>
>>>>>>>>> If Hive had the ability (organic) to have local variable and
>>>>>>>>> stored procedure support then it would be top notch Data Warehouse. Given
>>>>>>>>> its metastore, I don't see any technical reason why it cannot support these
>>>>>>>>> constructs.
>>>>>>>>>
>>>>>>>>> I was recently asked to comment on migration from commercial DWs
>>>>>>>>> to Big Data (primarily for TCO reason) and really could not recall any
>>>>>>>>> better candidate than Hive. Is HBase a viable alternative? Obviously
>>>>>>>>> whatever one decides there is still HDFS, a good engine for Hive (sounds
>>>>>>>>> like many prefer TEZ although I am a Spark fan) and the ubiquitous
>>>>>>>>> YARN.
>>>>>>>>>
>>>>>>>>> Let me know your thoughts.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Want to work at Handy? Check out our culture deck and open roles
>>>>>>>> <http://www.handy.com/careers>
>>>>>>>> Latest news <http://www.handy.com/press> at Handy
>>>>>>>> Handy just raised $50m
>>>>>>>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
>>>>>>>> by Fidelity
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Want to work at Handy? Check out our culture deck and open roles
>>>>>> <http://www.handy.com/careers>
>>>>>> Latest news <http://www.handy.com/press> at Handy
>>>>>> Handy just raised $50m
>>>>>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
>>>>>> by Fidelity
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Hive footprint

Posted by Sabarish Sasidharan <sa...@gmail.com>.

HBase is very good for direct key based lookups. And when you want to do
scans for a range of keys (data is sorted by keys)

Whereas Hive is not good for seeks (needle in haystack problem). You can
optimize with ORCs, stripes, sorting etc. But still it is a needle in a
haystack problem.

Apache Kylin takes a different approach. It maintains the cubes in HBase
but routes adhoc queries to Hive. So that's one way to see them as
complementary technologies solving problems relevant to their space in an
efficient manner.

Regards
Sab

On Wed, Apr 20, 2016 at 11:20 AM, Amey Barve <am...@gmail.com> wrote:

> Thanks Peyman,
>
> Is running and evaluating TPCH queries with HBaseStorageHandler vs Hive's
> Text format comparable?
> What is the standard set of queries generally used for performance
> comparison, What queries did you use above?
>
> Regards,
> Amey
>
>
>
> On Tue, Apr 19, 2016 at 7:28 PM, Peyman Mohajerian <mo...@gmail.com>
> wrote:
>
>> Hi Amey,
>>
>> It is about seek vs scan. HBase is great in case a rowkey or a range of
>> rowkeys is part of the where clause, then you do a seek and ORC/Parquest
>> reading off HDFS would not do better in absence of an index. However for
>> Data Warehouse that is generally not what you do, you mostly do scan, e.g.
>> doing aggregation you aren't looking for a particular record(s). In this
>> case the IO throughput dominates (generally), because you have to read lots
>> of data, then reading large blocks of data and using headers info
>> (predicate push-down) in ORC or Parquet will be faster compared to reading
>> lots of HFiles in HBase. Of course compaction in HBase can turn the files
>> to larger chunks but still 'typically' it will be slower.
>> I should super emphasized that making statements about what is faster or
>> not is very dangerous, there could be many exceptions depending on the type
>> of query and other factors. When I did this test I was using map/reduce and
>> with newer engines queries will be faster. Also caching in HBase is
>> critical, if all you data is cached and you got lots of memory and system
>> isn't busy handling compaction and lots of new write then your read
>> performance in all cases will improve. Always do your own POC and use your
>> own data to test.
>>
>> Thanks,
>> Peyman
>>
>>
>>
>> On Tue, Apr 19, 2016 at 2:26 AM, Amey Barve <am...@gmail.com>
>> wrote:
>>
>>> Hi Peyman,
>>>
>>> You say: "you can use Hive storage handler to read data from HBase the
>>> performance would be lower than reading from HDFS directly for analytic."
>>> Why is it so? Is it slow as compared to ORC, Parquet, and even Text file
>>> format?
>>>
>>> Regards,
>>> Amey
>>>
>>> On Tue, Apr 19, 2016 at 4:32 AM, Peyman Mohajerian <mo...@gmail.com>
>>> wrote:
>>>
>>>> HBase can handle high read/write throughput, e.g. IOT use cases. It is
>>>> not an analytic engine even though you can use Hive storage handler to read
>>>> data from HBase the performance would be lower than reading from HDFS
>>>> directly for analytic.  But HBase has index, rowkey and you can add
>>>> secondary index, usually with Elasticsearch or other means. You can also
>>>> run Phoenix over HBase to do analytic but again only if you data
>>>> collection/use case mandates HBase, e.g. small amount of data from millions
>>>> of devices. It is common to copy data from HBase to HDFS (even though HBase
>>>> is sitting on top of HDFS), as ORC/Parquet for very large analytic. But
>>>> again you do have the choice of using Phoenix or Hive to run analytic over
>>>> HBase if you don't want to pay for the cost of data copying.
>>>> HBase can only be part of a DW solution in a limited way, e.g. as index
>>>> to data in HDFS, partition discovery, etc. Pretty soon it will be the
>>>> metadata for Hive (optional instead of RDMS). HBase can  sits on the edge
>>>> of DW for collect fast landing data.
>>>> I don't see any compete between Hive and HBase, they work together and
>>>> I don't see modern DW having a monolithic engine, Tez+Spark+MPP+...
>>>>
>>>>
>>>>
>>>> On Mon, Apr 18, 2016 at 3:51 PM, Marcin Tustin <mt...@handybook.com>
>>>> wrote:
>>>>
>>>>> We use a hive with ORC setup now. Queries may take thousands of
>>>>> seconds with joins, and potentially tens of seconds with selects on very
>>>>> large tables.
>>>>>
>>>>> My understanding is that the goal of hbase is to provide much lower
>>>>> latency for queries. Obviously, this comes at the cost of not being able to
>>>>> perform joins. I don't actually use hbase, so I hesitate to say more about
>>>>> it.
>>>>>
>>>>> On Mon, Apr 18, 2016 at 6:48 PM, Mich Talebzadeh <
>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>
>>>>>> Thanks Marcin.
>>>>>>
>>>>>> What is the definition of low latency here? Are you referring to the
>>>>>> performance of SQL against HBase tables compared to Hive. As I understand
>>>>>> HBase is a columnar database. Would it be possible to use Hive against ORC
>>>>>> to achieve the same?
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 18 April 2016 at 23:43, Marcin Tustin <mt...@handybook.com>
>>>>>> wrote:
>>>>>>
>>>>>>> HBase has a different use case - it's for low-latency querying of
>>>>>>> big tables. If you combined it with Hive, you might have something nice for
>>>>>>> certain queries, but I wouldn't think of them as direct competitors.
>>>>>>>
>>>>>>> On Mon, Apr 18, 2016 at 6:34 PM, Mich Talebzadeh <
>>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I notice that Impala is rarely mentioned these days.  I may be
>>>>>>>> missing something. However, I gather it is coming to end now as I don't
>>>>>>>> recall many use cases for it (or customers asking for it). In contrast,
>>>>>>>> Hive has hold its ground with the new addition of Spark and Tez as
>>>>>>>> execution engines, support for ACID and ORC and new stuff in Hive 2. In
>>>>>>>> addition provided a good choice for its metastore it scales well.
>>>>>>>>
>>>>>>>> If Hive had the ability (organic) to have local variable and stored
>>>>>>>> procedure support then it would be top notch Data Warehouse. Given its
>>>>>>>> metastore, I don't see any technical reason why it cannot support these
>>>>>>>> constructs.
>>>>>>>>
>>>>>>>> I was recently asked to comment on migration from commercial DWs to
>>>>>>>> Big Data (primarily for TCO reason) and really could not recall any better
>>>>>>>> candidate than Hive. Is HBase a viable alternative? Obviously whatever one
>>>>>>>> decides there is still HDFS, a good engine for Hive (sounds like many
>>>>>>>> prefer TEZ although I am a Spark fan) and the ubiquitous YARN.
>>>>>>>>
>>>>>>>> Let me know your thoughts.
>>>>>>>>
>>>>>>>>
>>>>>>>> Dr Mich Talebzadeh
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Want to work at Handy? Check out our culture deck and open roles
>>>>>>> <http://www.handy.com/careers>
>>>>>>> Latest news <http://www.handy.com/press> at Handy
>>>>>>> Handy just raised $50m
>>>>>>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
>>>>>>> by Fidelity
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>> Want to work at Handy? Check out our culture deck and open roles
>>>>> <http://www.handy.com/careers>
>>>>> Latest news <http://www.handy.com/press> at Handy
>>>>> Handy just raised $50m
>>>>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
>>>>> by Fidelity
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Hive footprint

Posted by Amey Barve <am...@gmail.com>.

Thanks Peyman,

Is running and evaluating TPCH queries with HBaseStorageHandler vs Hive's
Text format comparable?
What is the standard set of queries generally used for performance
comparison, What queries did you use above?

Regards,
Amey



On Tue, Apr 19, 2016 at 7:28 PM, Peyman Mohajerian <mo...@gmail.com>
wrote:

> Hi Amey,
>
> It is about seek vs scan. HBase is great in case a rowkey or a range of
> rowkeys is part of the where clause, then you do a seek and ORC/Parquest
> reading off HDFS would not do better in absence of an index. However for
> Data Warehouse that is generally not what you do, you mostly do scan, e.g.
> doing aggregation you aren't looking for a particular record(s). In this
> case the IO throughput dominates (generally), because you have to read lots
> of data, then reading large blocks of data and using headers info
> (predicate push-down) in ORC or Parquet will be faster compared to reading
> lots of HFiles in HBase. Of course compaction in HBase can turn the files
> to larger chunks but still 'typically' it will be slower.
> I should super emphasized that making statements about what is faster or
> not is very dangerous, there could be many exceptions depending on the type
> of query and other factors. When I did this test I was using map/reduce and
> with newer engines queries will be faster. Also caching in HBase is
> critical, if all you data is cached and you got lots of memory and system
> isn't busy handling compaction and lots of new write then your read
> performance in all cases will improve. Always do your own POC and use your
> own data to test.
>
> Thanks,
> Peyman
>
>
>
> On Tue, Apr 19, 2016 at 2:26 AM, Amey Barve <am...@gmail.com> wrote:
>
>> Hi Peyman,
>>
>> You say: "you can use Hive storage handler to read data from HBase the
>> performance would be lower than reading from HDFS directly for analytic."
>> Why is it so? Is it slow as compared to ORC, Parquet, and even Text file
>> format?
>>
>> Regards,
>> Amey
>>
>> On Tue, Apr 19, 2016 at 4:32 AM, Peyman Mohajerian <mo...@gmail.com>
>> wrote:
>>
>>> HBase can handle high read/write throughput, e.g. IOT use cases. It is
>>> not an analytic engine even though you can use Hive storage handler to read
>>> data from HBase the performance would be lower than reading from HDFS
>>> directly for analytic.  But HBase has index, rowkey and you can add
>>> secondary index, usually with Elasticsearch or other means. You can also
>>> run Phoenix over HBase to do analytic but again only if you data
>>> collection/use case mandates HBase, e.g. small amount of data from millions
>>> of devices. It is common to copy data from HBase to HDFS (even though HBase
>>> is sitting on top of HDFS), as ORC/Parquet for very large analytic. But
>>> again you do have the choice of using Phoenix or Hive to run analytic over
>>> HBase if you don't want to pay for the cost of data copying.
>>> HBase can only be part of a DW solution in a limited way, e.g. as index
>>> to data in HDFS, partition discovery, etc. Pretty soon it will be the
>>> metadata for Hive (optional instead of RDMS). HBase can  sits on the edge
>>> of DW for collect fast landing data.
>>> I don't see any compete between Hive and HBase, they work together and I
>>> don't see modern DW having a monolithic engine, Tez+Spark+MPP+...
>>>
>>>
>>>
>>> On Mon, Apr 18, 2016 at 3:51 PM, Marcin Tustin <mt...@handybook.com>
>>> wrote:
>>>
>>>> We use a hive with ORC setup now. Queries may take thousands of seconds
>>>> with joins, and potentially tens of seconds with selects on very large
>>>> tables.
>>>>
>>>> My understanding is that the goal of hbase is to provide much lower
>>>> latency for queries. Obviously, this comes at the cost of not being able to
>>>> perform joins. I don't actually use hbase, so I hesitate to say more about
>>>> it.
>>>>
>>>> On Mon, Apr 18, 2016 at 6:48 PM, Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com> wrote:
>>>>
>>>>> Thanks Marcin.
>>>>>
>>>>> What is the definition of low latency here? Are you referring to the
>>>>> performance of SQL against HBase tables compared to Hive. As I understand
>>>>> HBase is a columnar database. Would it be possible to use Hive against ORC
>>>>> to achieve the same?
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>> On 18 April 2016 at 23:43, Marcin Tustin <mt...@handybook.com>
>>>>> wrote:
>>>>>
>>>>>> HBase has a different use case - it's for low-latency querying of big
>>>>>> tables. If you combined it with Hive, you might have something nice for
>>>>>> certain queries, but I wouldn't think of them as direct competitors.
>>>>>>
>>>>>> On Mon, Apr 18, 2016 at 6:34 PM, Mich Talebzadeh <
>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I notice that Impala is rarely mentioned these days.  I may be
>>>>>>> missing something. However, I gather it is coming to end now as I don't
>>>>>>> recall many use cases for it (or customers asking for it). In contrast,
>>>>>>> Hive has hold its ground with the new addition of Spark and Tez as
>>>>>>> execution engines, support for ACID and ORC and new stuff in Hive 2. In
>>>>>>> addition provided a good choice for its metastore it scales well.
>>>>>>>
>>>>>>> If Hive had the ability (organic) to have local variable and stored
>>>>>>> procedure support then it would be top notch Data Warehouse. Given its
>>>>>>> metastore, I don't see any technical reason why it cannot support these
>>>>>>> constructs.
>>>>>>>
>>>>>>> I was recently asked to comment on migration from commercial DWs to
>>>>>>> Big Data (primarily for TCO reason) and really could not recall any better
>>>>>>> candidate than Hive. Is HBase a viable alternative? Obviously whatever one
>>>>>>> decides there is still HDFS, a good engine for Hive (sounds like many
>>>>>>> prefer TEZ although I am a Spark fan) and the ubiquitous YARN.
>>>>>>>
>>>>>>> Let me know your thoughts.
>>>>>>>
>>>>>>>
>>>>>>> Dr Mich Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> http://talebzadehmich.wordpress.com
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> Want to work at Handy? Check out our culture deck and open roles
>>>>>> <http://www.handy.com/careers>
>>>>>> Latest news <http://www.handy.com/press> at Handy
>>>>>> Handy just raised $50m
>>>>>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
>>>>>> by Fidelity
>>>>>>
>>>>>>
>>>>>
>>>>
>>>> Want to work at Handy? Check out our culture deck and open roles
>>>> <http://www.handy.com/careers>
>>>> Latest news <http://www.handy.com/press> at Handy
>>>> Handy just raised $50m
>>>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
>>>> by Fidelity
>>>>
>>>>
>>>
>>
>

Re: Hive footprint

Posted by Peyman Mohajerian <mo...@gmail.com>.

Hi Amey,

It is about seek vs scan. HBase is great in case a rowkey or a range of
rowkeys is part of the where clause, then you do a seek and ORC/Parquest
reading off HDFS would not do better in absence of an index. However for
Data Warehouse that is generally not what you do, you mostly do scan, e.g.
doing aggregation you aren't looking for a particular record(s). In this
case the IO throughput dominates (generally), because you have to read lots
of data, then reading large blocks of data and using headers info
(predicate push-down) in ORC or Parquet will be faster compared to reading
lots of HFiles in HBase. Of course compaction in HBase can turn the files
to larger chunks but still 'typically' it will be slower.
I should super emphasized that making statements about what is faster or
not is very dangerous, there could be many exceptions depending on the type
of query and other factors. When I did this test I was using map/reduce and
with newer engines queries will be faster. Also caching in HBase is
critical, if all you data is cached and you got lots of memory and system
isn't busy handling compaction and lots of new write then your read
performance in all cases will improve. Always do your own POC and use your
own data to test.

Thanks,
Peyman



On Tue, Apr 19, 2016 at 2:26 AM, Amey Barve <am...@gmail.com> wrote:

> Hi Peyman,
>
> You say: "you can use Hive storage handler to read data from HBase the
> performance would be lower than reading from HDFS directly for analytic."
> Why is it so? Is it slow as compared to ORC, Parquet, and even Text file
> format?
>
> Regards,
> Amey
>
> On Tue, Apr 19, 2016 at 4:32 AM, Peyman Mohajerian <mo...@gmail.com>
> wrote:
>
>> HBase can handle high read/write throughput, e.g. IOT use cases. It is
>> not an analytic engine even though you can use Hive storage handler to read
>> data from HBase the performance would be lower than reading from HDFS
>> directly for analytic.  But HBase has index, rowkey and you can add
>> secondary index, usually with Elasticsearch or other means. You can also
>> run Phoenix over HBase to do analytic but again only if you data
>> collection/use case mandates HBase, e.g. small amount of data from millions
>> of devices. It is common to copy data from HBase to HDFS (even though HBase
>> is sitting on top of HDFS), as ORC/Parquet for very large analytic. But
>> again you do have the choice of using Phoenix or Hive to run analytic over
>> HBase if you don't want to pay for the cost of data copying.
>> HBase can only be part of a DW solution in a limited way, e.g. as index
>> to data in HDFS, partition discovery, etc. Pretty soon it will be the
>> metadata for Hive (optional instead of RDMS). HBase can  sits on the edge
>> of DW for collect fast landing data.
>> I don't see any compete between Hive and HBase, they work together and I
>> don't see modern DW having a monolithic engine, Tez+Spark+MPP+...
>>
>>
>>
>> On Mon, Apr 18, 2016 at 3:51 PM, Marcin Tustin <mt...@handybook.com>
>> wrote:
>>
>>> We use a hive with ORC setup now. Queries may take thousands of seconds
>>> with joins, and potentially tens of seconds with selects on very large
>>> tables.
>>>
>>> My understanding is that the goal of hbase is to provide much lower
>>> latency for queries. Obviously, this comes at the cost of not being able to
>>> perform joins. I don't actually use hbase, so I hesitate to say more about
>>> it.
>>>
>>> On Mon, Apr 18, 2016 at 6:48 PM, Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> wrote:
>>>
>>>> Thanks Marcin.
>>>>
>>>> What is the definition of low latency here? Are you referring to the
>>>> performance of SQL against HBase tables compared to Hive. As I understand
>>>> HBase is a columnar database. Would it be possible to use Hive against ORC
>>>> to achieve the same?
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 18 April 2016 at 23:43, Marcin Tustin <mt...@handybook.com> wrote:
>>>>
>>>>> HBase has a different use case - it's for low-latency querying of big
>>>>> tables. If you combined it with Hive, you might have something nice for
>>>>> certain queries, but I wouldn't think of them as direct competitors.
>>>>>
>>>>> On Mon, Apr 18, 2016 at 6:34 PM, Mich Talebzadeh <
>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I notice that Impala is rarely mentioned these days.  I may be
>>>>>> missing something. However, I gather it is coming to end now as I don't
>>>>>> recall many use cases for it (or customers asking for it). In contrast,
>>>>>> Hive has hold its ground with the new addition of Spark and Tez as
>>>>>> execution engines, support for ACID and ORC and new stuff in Hive 2. In
>>>>>> addition provided a good choice for its metastore it scales well.
>>>>>>
>>>>>> If Hive had the ability (organic) to have local variable and stored
>>>>>> procedure support then it would be top notch Data Warehouse. Given its
>>>>>> metastore, I don't see any technical reason why it cannot support these
>>>>>> constructs.
>>>>>>
>>>>>> I was recently asked to comment on migration from commercial DWs to
>>>>>> Big Data (primarily for TCO reason) and really could not recall any better
>>>>>> candidate than Hive. Is HBase a viable alternative? Obviously whatever one
>>>>>> decides there is still HDFS, a good engine for Hive (sounds like many
>>>>>> prefer TEZ although I am a Spark fan) and the ubiquitous YARN.
>>>>>>
>>>>>> Let me know your thoughts.
>>>>>>
>>>>>>
>>>>>> Dr Mich Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://talebzadehmich.wordpress.com
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> Want to work at Handy? Check out our culture deck and open roles
>>>>> <http://www.handy.com/careers>
>>>>> Latest news <http://www.handy.com/press> at Handy
>>>>> Handy just raised $50m
>>>>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
>>>>> by Fidelity
>>>>>
>>>>>
>>>>
>>>
>>> Want to work at Handy? Check out our culture deck and open roles
>>> <http://www.handy.com/careers>
>>> Latest news <http://www.handy.com/press> at Handy
>>> Handy just raised $50m
>>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
>>> by Fidelity
>>>
>>>
>>
>

Re: Hive footprint

Posted by Amey Barve <am...@gmail.com>.

Hi Peyman,

You say: "you can use Hive storage handler to read data from HBase the
performance would be lower than reading from HDFS directly for analytic."
Why is it so? Is it slow as compared to ORC, Parquet, and even Text file
format?

Regards,
Amey

On Tue, Apr 19, 2016 at 4:32 AM, Peyman Mohajerian <mo...@gmail.com>
wrote:

> HBase can handle high read/write throughput, e.g. IOT use cases. It is not
> an analytic engine even though you can use Hive storage handler to read
> data from HBase the performance would be lower than reading from HDFS
> directly for analytic.  But HBase has index, rowkey and you can add
> secondary index, usually with Elasticsearch or other means. You can also
> run Phoenix over HBase to do analytic but again only if you data
> collection/use case mandates HBase, e.g. small amount of data from millions
> of devices. It is common to copy data from HBase to HDFS (even though HBase
> is sitting on top of HDFS), as ORC/Parquet for very large analytic. But
> again you do have the choice of using Phoenix or Hive to run analytic over
> HBase if you don't want to pay for the cost of data copying.
> HBase can only be part of a DW solution in a limited way, e.g. as index to
> data in HDFS, partition discovery, etc. Pretty soon it will be the metadata
> for Hive (optional instead of RDMS). HBase can  sits on the edge of DW for
> collect fast landing data.
> I don't see any compete between Hive and HBase, they work together and I
> don't see modern DW having a monolithic engine, Tez+Spark+MPP+...
>
>
>
> On Mon, Apr 18, 2016 at 3:51 PM, Marcin Tustin <mt...@handybook.com>
> wrote:
>
>> We use a hive with ORC setup now. Queries may take thousands of seconds
>> with joins, and potentially tens of seconds with selects on very large
>> tables.
>>
>> My understanding is that the goal of hbase is to provide much lower
>> latency for queries. Obviously, this comes at the cost of not being able to
>> perform joins. I don't actually use hbase, so I hesitate to say more about
>> it.
>>
>> On Mon, Apr 18, 2016 at 6:48 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> Thanks Marcin.
>>>
>>> What is the definition of low latency here? Are you referring to the
>>> performance of SQL against HBase tables compared to Hive. As I understand
>>> HBase is a columnar database. Would it be possible to use Hive against ORC
>>> to achieve the same?
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 18 April 2016 at 23:43, Marcin Tustin <mt...@handybook.com> wrote:
>>>
>>>> HBase has a different use case - it's for low-latency querying of big
>>>> tables. If you combined it with Hive, you might have something nice for
>>>> certain queries, but I wouldn't think of them as direct competitors.
>>>>
>>>> On Mon, Apr 18, 2016 at 6:34 PM, Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I notice that Impala is rarely mentioned these days.  I may be missing
>>>>> something. However, I gather it is coming to end now as I don't recall many
>>>>> use cases for it (or customers asking for it). In contrast, Hive has hold
>>>>> its ground with the new addition of Spark and Tez as execution engines,
>>>>> support for ACID and ORC and new stuff in Hive 2. In addition provided a
>>>>> good choice for its metastore it scales well.
>>>>>
>>>>> If Hive had the ability (organic) to have local variable and stored
>>>>> procedure support then it would be top notch Data Warehouse. Given its
>>>>> metastore, I don't see any technical reason why it cannot support these
>>>>> constructs.
>>>>>
>>>>> I was recently asked to comment on migration from commercial DWs to
>>>>> Big Data (primarily for TCO reason) and really could not recall any better
>>>>> candidate than Hive. Is HBase a viable alternative? Obviously whatever one
>>>>> decides there is still HDFS, a good engine for Hive (sounds like many
>>>>> prefer TEZ although I am a Spark fan) and the ubiquitous YARN.
>>>>>
>>>>> Let me know your thoughts.
>>>>>
>>>>>
>>>>> Dr Mich Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>>
>>>>>
>>>>>
>>>>> http://talebzadehmich.wordpress.com
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> Want to work at Handy? Check out our culture deck and open roles
>>>> <http://www.handy.com/careers>
>>>> Latest news <http://www.handy.com/press> at Handy
>>>> Handy just raised $50m
>>>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
>>>> by Fidelity
>>>>
>>>>
>>>
>>
>> Want to work at Handy? Check out our culture deck and open roles
>> <http://www.handy.com/careers>
>> Latest news <http://www.handy.com/press> at Handy
>> Handy just raised $50m
>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
>> by Fidelity
>>
>>
>

Re: Hive footprint

Posted by Peyman Mohajerian <mo...@gmail.com>.

HBase can handle high read/write throughput, e.g. IOT use cases. It is not
an analytic engine even though you can use Hive storage handler to read
data from HBase the performance would be lower than reading from HDFS
directly for analytic.  But HBase has index, rowkey and you can add
secondary index, usually with Elasticsearch or other means. You can also
run Phoenix over HBase to do analytic but again only if you data
collection/use case mandates HBase, e.g. small amount of data from millions
of devices. It is common to copy data from HBase to HDFS (even though HBase
is sitting on top of HDFS), as ORC/Parquet for very large analytic. But
again you do have the choice of using Phoenix or Hive to run analytic over
HBase if you don't want to pay for the cost of data copying.
HBase can only be part of a DW solution in a limited way, e.g. as index to
data in HDFS, partition discovery, etc. Pretty soon it will be the metadata
for Hive (optional instead of RDMS). HBase can  sits on the edge of DW for
collect fast landing data.
I don't see any compete between Hive and HBase, they work together and I
don't see modern DW having a monolithic engine, Tez+Spark+MPP+...



On Mon, Apr 18, 2016 at 3:51 PM, Marcin Tustin <mt...@handybook.com>
wrote:

> We use a hive with ORC setup now. Queries may take thousands of seconds
> with joins, and potentially tens of seconds with selects on very large
> tables.
>
> My understanding is that the goal of hbase is to provide much lower
> latency for queries. Obviously, this comes at the cost of not being able to
> perform joins. I don't actually use hbase, so I hesitate to say more about
> it.
>
> On Mon, Apr 18, 2016 at 6:48 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> Thanks Marcin.
>>
>> What is the definition of low latency here? Are you referring to the
>> performance of SQL against HBase tables compared to Hive. As I understand
>> HBase is a columnar database. Would it be possible to use Hive against ORC
>> to achieve the same?
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 18 April 2016 at 23:43, Marcin Tustin <mt...@handybook.com> wrote:
>>
>>> HBase has a different use case - it's for low-latency querying of big
>>> tables. If you combined it with Hive, you might have something nice for
>>> certain queries, but I wouldn't think of them as direct competitors.
>>>
>>> On Mon, Apr 18, 2016 at 6:34 PM, Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I notice that Impala is rarely mentioned these days.  I may be missing
>>>> something. However, I gather it is coming to end now as I don't recall many
>>>> use cases for it (or customers asking for it). In contrast, Hive has hold
>>>> its ground with the new addition of Spark and Tez as execution engines,
>>>> support for ACID and ORC and new stuff in Hive 2. In addition provided a
>>>> good choice for its metastore it scales well.
>>>>
>>>> If Hive had the ability (organic) to have local variable and stored
>>>> procedure support then it would be top notch Data Warehouse. Given its
>>>> metastore, I don't see any technical reason why it cannot support these
>>>> constructs.
>>>>
>>>> I was recently asked to comment on migration from commercial DWs to Big
>>>> Data (primarily for TCO reason) and really could not recall any better
>>>> candidate than Hive. Is HBase a viable alternative? Obviously whatever one
>>>> decides there is still HDFS, a good engine for Hive (sounds like many
>>>> prefer TEZ although I am a Spark fan) and the ubiquitous YARN.
>>>>
>>>> Let me know your thoughts.
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>
>>>
>>> Want to work at Handy? Check out our culture deck and open roles
>>> <http://www.handy.com/careers>
>>> Latest news <http://www.handy.com/press> at Handy
>>> Handy just raised $50m
>>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
>>> by Fidelity
>>>
>>>
>>
>
> Want to work at Handy? Check out our culture deck and open roles
> <http://www.handy.com/careers>
> Latest news <http://www.handy.com/press> at Handy
> Handy just raised $50m
> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
> by Fidelity
>
>

Re: Hive footprint

Posted by Marcin Tustin <mt...@handybook.com>.

We use a hive with ORC setup now. Queries may take thousands of seconds
with joins, and potentially tens of seconds with selects on very large
tables.

My understanding is that the goal of hbase is to provide much lower latency
for queries. Obviously, this comes at the cost of not being able to perform
joins. I don't actually use hbase, so I hesitate to say more about it.

On Mon, Apr 18, 2016 at 6:48 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Thanks Marcin.
>
> What is the definition of low latency here? Are you referring to the
> performance of SQL against HBase tables compared to Hive. As I understand
> HBase is a columnar database. Would it be possible to use Hive against ORC
> to achieve the same?
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 18 April 2016 at 23:43, Marcin Tustin <mt...@handybook.com> wrote:
>
>> HBase has a different use case - it's for low-latency querying of big
>> tables. If you combined it with Hive, you might have something nice for
>> certain queries, but I wouldn't think of them as direct competitors.
>>
>> On Mon, Apr 18, 2016 at 6:34 PM, Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I notice that Impala is rarely mentioned these days.  I may be missing
>>> something. However, I gather it is coming to end now as I don't recall many
>>> use cases for it (or customers asking for it). In contrast, Hive has hold
>>> its ground with the new addition of Spark and Tez as execution engines,
>>> support for ACID and ORC and new stuff in Hive 2. In addition provided a
>>> good choice for its metastore it scales well.
>>>
>>> If Hive had the ability (organic) to have local variable and stored
>>> procedure support then it would be top notch Data Warehouse. Given its
>>> metastore, I don't see any technical reason why it cannot support these
>>> constructs.
>>>
>>> I was recently asked to comment on migration from commercial DWs to Big
>>> Data (primarily for TCO reason) and really could not recall any better
>>> candidate than Hive. Is HBase a viable alternative? Obviously whatever one
>>> decides there is still HDFS, a good engine for Hive (sounds like many
>>> prefer TEZ although I am a Spark fan) and the ubiquitous YARN.
>>>
>>> Let me know your thoughts.
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>
>>
>> Want to work at Handy? Check out our culture deck and open roles
>> <http://www.handy.com/careers>
>> Latest news <http://www.handy.com/press> at Handy
>> Handy just raised $50m
>> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
>> by Fidelity
>>
>>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 
<http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m 
<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led 
by Fidelity

Re: Hive footprint

Posted by Mich Talebzadeh <mi...@gmail.com>.

Thanks Marcin.

What is the definition of low latency here? Are you referring to the
performance of SQL against HBase tables compared to Hive. As I understand
HBase is a columnar database. Would it be possible to use Hive against ORC
to achieve the same?

Dr Mich Talebzadeh



LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 18 April 2016 at 23:43, Marcin Tustin <mt...@handybook.com> wrote:

> HBase has a different use case - it's for low-latency querying of big
> tables. If you combined it with Hive, you might have something nice for
> certain queries, but I wouldn't think of them as direct competitors.
>
> On Mon, Apr 18, 2016 at 6:34 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> Hi,
>>
>> I notice that Impala is rarely mentioned these days.  I may be missing
>> something. However, I gather it is coming to end now as I don't recall many
>> use cases for it (or customers asking for it). In contrast, Hive has hold
>> its ground with the new addition of Spark and Tez as execution engines,
>> support for ACID and ORC and new stuff in Hive 2. In addition provided a
>> good choice for its metastore it scales well.
>>
>> If Hive had the ability (organic) to have local variable and stored
>> procedure support then it would be top notch Data Warehouse. Given its
>> metastore, I don't see any technical reason why it cannot support these
>> constructs.
>>
>> I was recently asked to comment on migration from commercial DWs to Big
>> Data (primarily for TCO reason) and really could not recall any better
>> candidate than Hive. Is HBase a viable alternative? Obviously whatever one
>> decides there is still HDFS, a good engine for Hive (sounds like many
>> prefer TEZ although I am a Spark fan) and the ubiquitous YARN.
>>
>> Let me know your thoughts.
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>
>
> Want to work at Handy? Check out our culture deck and open roles
> <http://www.handy.com/careers>
> Latest news <http://www.handy.com/press> at Handy
> Handy just raised $50m
> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
> by Fidelity
>
>

Re: Hive footprint

Posted by Marcin Tustin <mt...@handybook.com>.

HBase has a different use case - it's for low-latency querying of big
tables. If you combined it with Hive, you might have something nice for
certain queries, but I wouldn't think of them as direct competitors.

On Mon, Apr 18, 2016 at 6:34 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Hi,
>
> I notice that Impala is rarely mentioned these days.  I may be missing
> something. However, I gather it is coming to end now as I don't recall many
> use cases for it (or customers asking for it). In contrast, Hive has hold
> its ground with the new addition of Spark and Tez as execution engines,
> support for ACID and ORC and new stuff in Hive 2. In addition provided a
> good choice for its metastore it scales well.
>
> If Hive had the ability (organic) to have local variable and stored
> procedure support then it would be top notch Data Warehouse. Given its
> metastore, I don't see any technical reason why it cannot support these
> constructs.
>
> I was recently asked to comment on migration from commercial DWs to Big
> Data (primarily for TCO reason) and really could not recall any better
> candidate than Hive. Is HBase a viable alternative? Obviously whatever one
> decides there is still HDFS, a good engine for Hive (sounds like many
> prefer TEZ although I am a Spark fan) and the ubiquitous YARN.
>
> Let me know your thoughts.
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>

-- 
Want to work at Handy? Check out our culture deck and open roles 
<http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m 
<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led 
by Fidelity

Re: Hive footprint

Posted by Alan Gates <al...@gmail.com>.

> On Apr 18, 2016, at 15:34, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> Hi,
> <snip>
> 
> If Hive had the ability (organic) to have local variable and stored procedure support then it would be top notch Data Warehouse. Given its metastore, I don't see any technical reason why it cannot support these constructs.
> 
Are you aware of the HPL/SQL module added in Hive 2.0?  If not see https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=59690156  It’s not deeply integrated yet but it’s a step in the procedural direction. 

Alan.

Re: Hive footprint

Posted by Mich Talebzadeh <mi...@gmail.com>.

Hi Naveen,

Thank you for your detailed explanation.

Please allow me to explain my points if I may

I think a viable solution for big data stack will encompass (again this is
my view) Spark with Hive, HDFS and Yarn as winning combinations. Hadoop
encompasses HDFS and it is almost impossible to side step it without
finding a viable alternative as a persistent storage. Yarn is the resource
rock, Spark is a great query tool including Spark streaming and Hive is the
real Data Warehouse in Big data space that provides the meta-data for all
the tools.

You will forgive me to set aside Impala as I don't hear much about it
anymore (please feel free to agree to differ). So my prime interest is to
see Hive being improved as it should be, i.e. a proper Data Warehouse with
proper indexing strategy. I don’t really subscribe to ORC storage index as
through my experience they have not delivered the contribution to Hive CBO
as expected. My personal experience has been that they provide some
improvements on what is already available (Stats wise), but otherwise
unless you bucket your table (i.e. have an effective numeric column with
high cardinality that can be used in hash partitioning the table), one
cannot make effective use of storage index.

Now back to Hive and its external indexes. Currently the infrastructure is
there but not the functionality. I don’t know what it takes to make indexes
in Hive accountable for the CBO. We should aim to consolidate Hadoop
ecosystem by investing in the existing tools rather than trying to fragment
it further. There seems to be little effort in this area for reasons that I
may not be aware. However, I am more than happy to contribute to this case.

Kind regards,

Mich

Dr Mich Talebzadeh

LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*

http://talebzadehmich.wordpress.com

On 25 April 2016 at 19:28, Naveen Gangam <ng...@cloudera.com> wrote:

> Hi Mich,
> I am a developer at Cloudera and contribute to Apache Hive.
>
> Hive and MPP query engine projects like Impala have settled into their
> respective positions so there is less confusion between these projects.
>
> For example, across Cloudera's customer base the majority of customers use
> Impala to enable them to perform BI and SQL analytics directly on Hadoop.
> Most Impala users are using Hive for the data preparation of the data sets
> they're serving up via Impala. As such Impala typically competes with
> traditional analytic databases where customers decide between:
>     * Using Hadoop and Hive for data processing that feeds into another
> database or BI layer for the analytics
>     * Unified architecture where they directly serve some sets of BI and
> analytics from Hadoop via Impala while typically using Hive, Spark,
> MapReduce, etc for their data preparation
> You can see nearly all Hadoop distributions provide users with Hive for
> core data processing plus an MPP query engine for BI and SQL analytics like
> Impala, Drill, BigSQL, etc. Even Facebook who created and still heavily
> uses Hive, also uses Presto internally as their MPP query engine for BI.
>
> For more details you can see Cloudera's SQL-on-Hadoop webinar that talks
> about when to use Hive, Impala, and Spark (SQL)
> <http://www.cloudera.com/resources/recordedwebinar/hive-impala-and-spark-oh-my-sql-on-Hadoop-in-cloudera-5-5.html>
>
>
> Support for local variables and stored procedures in Hive is included in
> HPL/SQL module of Hive 2.0. However, this is an experimental feature. We
> will evaluate it for production-readiness before including it in CDH Hive.
>
> Finally, HBase is typically not the best storage manager for migrations
> from commercial DWs to Big Data. Most commercial DW migrations use HDFS
> rather than HBase as the storage manager.
>
> Hope this helps.
>
> Thank you
> Naveen
>
> On Mon, Apr 18, 2016 at 6:34 PM, Mich Talebzadeh <
> mich.talebzadeh@gmail.com> wrote:
>
>> Hi,
>>
>> I notice that Impala is rarely mentioned these days.  I may be missing
>> something. However, I gather it is coming to end now as I don't recall many
>> use cases for it (or customers asking for it). In contrast, Hive has hold
>> its ground with the new addition of Spark and Tez as execution engines,
>> support for ACID and ORC and new stuff in Hive 2. In addition provided a
>> good choice for its metastore it scales well.
>>
>> If Hive had the ability (organic) to have local variable and stored
>> procedure support then it would be top notch Data Warehouse. Given its
>> metastore, I don't see any technical reason why it cannot support these
>> constructs.
>>
>> I was recently asked to comment on migration from commercial DWs to Big
>> Data (primarily for TCO reason) and really could not recall any better
>> candidate than Hive. Is HBase a viable alternative? Obviously whatever one
>> decides there is still HDFS, a good engine for Hive (sounds like many
>> prefer TEZ although I am a Spark fan) and the ubiquitous YARN.
>>
>> Let me know your thoughts.
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>
>

Re: Hive footprint

Posted by Naveen Gangam <ng...@cloudera.com>.

Hi Mich,
I am a developer at Cloudera and contribute to Apache Hive.

Hive and MPP query engine projects like Impala have settled into their
respective positions so there is less confusion between these projects.

For example, across Cloudera's customer base the majority of customers use
Impala to enable them to perform BI and SQL analytics directly on Hadoop.
Most Impala users are using Hive for the data preparation of the data sets
they're serving up via Impala. As such Impala typically competes with
traditional analytic databases where customers decide between:
    * Using Hadoop and Hive for data processing that feeds into another
database or BI layer for the analytics
    * Unified architecture where they directly serve some sets of BI and
analytics from Hadoop via Impala while typically using Hive, Spark,
MapReduce, etc for their data preparation
You can see nearly all Hadoop distributions provide users with Hive for
core data processing plus an MPP query engine for BI and SQL analytics like
Impala, Drill, BigSQL, etc. Even Facebook who created and still heavily
uses Hive, also uses Presto internally as their MPP query engine for BI.

For more details you can see Cloudera's SQL-on-Hadoop webinar that talks
about when to use Hive, Impala, and Spark (SQL)
<http://www.cloudera.com/resources/recordedwebinar/hive-impala-and-spark-oh-my-sql-on-Hadoop-in-cloudera-5-5.html>


Support for local variables and stored procedures in Hive is included in
HPL/SQL module of Hive 2.0. However, this is an experimental feature. We
will evaluate it for production-readiness before including it in CDH Hive.

Finally, HBase is typically not the best storage manager for migrations
from commercial DWs to Big Data. Most commercial DW migrations use HDFS
rather than HBase as the storage manager.

Hope this helps.

Thank you
Naveen

On Mon, Apr 18, 2016 at 6:34 PM, Mich Talebzadeh <mi...@gmail.com>
wrote:

> Hi,
>
> I notice that Impala is rarely mentioned these days.  I may be missing
> something. However, I gather it is coming to end now as I don't recall many
> use cases for it (or customers asking for it). In contrast, Hive has hold
> its ground with the new addition of Spark and Tez as execution engines,
> support for ACID and ORC and new stuff in Hive 2. In addition provided a
> good choice for its metastore it scales well.
>
> If Hive had the ability (organic) to have local variable and stored
> procedure support then it would be top notch Data Warehouse. Given its
> metastore, I don't see any technical reason why it cannot support these
> constructs.
>
> I was recently asked to comment on migration from commercial DWs to Big
> Data (primarily for TCO reason) and really could not recall any better
> candidate than Hive. Is HBase a viable alternative? Obviously whatever one
> decides there is still HDFS, a good engine for Hive (sounds like many
> prefer TEZ although I am a Spark fan) and the ubiquitous YARN.
>
> Let me know your thoughts.
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>

Re: Hive footprint

Posted by Jörn Franke <jo...@gmail.com>.

Depends really what you want to do. Hive is more for queries involving a lot of data, whereby hbase+Phoenix is more for oltp scenarios or sensor ingestion.

I think the reason is that hive has been the entry point for many engines and formats. Additionally there is a lot of tuning capabilities from hardware over software to make it fast. Thus, other software always had it a little bit difficult.


> On 19 Apr 2016, at 00:34, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> Hi,
> 
> I notice that Impala is rarely mentioned these days.  I may be missing something. However, I gather it is coming to end now as I don't recall many use cases for it (or customers asking for it). In contrast, Hive has hold its ground with the new addition of Spark and Tez as execution engines, support for ACID and ORC and new stuff in Hive 2. In addition provided a good choice for its metastore it scales well.
> 
> If Hive had the ability (organic) to have local variable and stored procedure support then it would be top notch Data Warehouse. Given its metastore, I don't see any technical reason why it cannot support these constructs.
> 
> I was recently asked to comment on migration from commercial DWs to Big Data (primarily for TCO reason) and really could not recall any better candidate than Hive. Is HBase a viable alternative? Obviously whatever one decides there is still HDFS, a good engine for Hive (sounds like many prefer TEZ although I am a Spark fan) and the ubiquitous YARN.
> 
> Let me know your thoughts.
> 
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
>