You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Rupinder Singh <rs...@care.com> on 2013/04/30 19:48:32 UTC

Very poor read performance with composite keys in hbase

Hi,

I have an hbase cluster where I have a table with a composite key. I map this table to a Hive external table using which I insert/select data into/from this table:
CREATE EXTERNAL TABLE event(key struct<name:string,dateCreated:string,uid:string>, {more columns here})
ROW FORMAT DELIMITED
COLLECTION ITEMS TERMINATED BY '~'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, other columns ")
TBLPROPERTIES ("hbase.table.name" = "event");

The table has about 10 million rows. When I do a select * using all 3 components of the key, essentially selecting just 1 row, the response time is almost 700 sec, which seems pretty bad.

For comparison purpose, I created another table with a simple string key, and the rest of the columns etc same. The key is a string UUID. Table has same number of column families and same number of rows.
CREATE EXTERNAL TABLE test_event(key string, blah blah.....
TBLPROPERTIES ("hbase.table.name" = "test_event");

When I select a single row from this table by doing select * where key='something', the response time is 35 sec.

This seems to indicate that in case of composite keys, there is a full table scan happening.  This seems weird.

What am I missing here? Is there something special I need to do to get good read performance if I am using composite keys ?
Insert performance in both cases is comparable and is as per expectation.

Any help is appreciated.
Here is the env spec:

Amazon EMR
Hbase Cluster- 3 core nodes with 7.5 GB RAM each, 2 CPUs of 2.2 GHz each. Master 7.5 GB RAM, 2 CPUs of 2.2 GHz each
Hive Cluster - 3 core nodes 3.75 GB RAM each, 1 CPU of 1.8 GHz. Master 3.75 GB RAM, 1 CPU of 1.8 GHz

Thanks
Rupinder



This email is intended for the person(s) to whom it is addressed and may contain information that is PRIVILEGED or CONFIDENTIAL. Any unauthorized use, distribution, copying, or disclosure by any person other than the addressee(s) is strictly prohibited. If you have received this email in error, please notify the sender immediately by return email and delete the message and any attachments from your system.

RE: Very poor read performance with composite keys in hbase

Posted by Rupinder Singh <rs...@care.com>.
Here it is:

select * from event where key.name='Signup' and key.dateCreated='2013-03-06 16:39:55.353' and key.uid='7af4c330-5988-4255-9250-924ce5864e3bf';


From: kulkarni.swarnim@gmail.com [mailto:kulkarni.swarnim@gmail.com]
Sent: Tuesday, April 30, 2013 11:25 PM
To: user@hive.apache.org
Cc: user@hbase.apache.org
Subject: Re: Very poor read performance with composite keys in hbase

Can you show your query that is taking 700 seconds?

On Tue, Apr 30, 2013 at 12:48 PM, Rupinder Singh <rs...@care.com>> wrote:
Hi,

I have an hbase cluster where I have a table with a composite key. I map this table to a Hive external table using which I insert/select data into/from this table:
CREATE EXTERNAL TABLE event(key struct<name:string,dateCreated:string,uid:string>, {more columns here})
ROW FORMAT DELIMITED
COLLECTION ITEMS TERMINATED BY '~'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, other columns ")
TBLPROPERTIES ("hbase.table.name<http://hbase.table.name>" = "event");

The table has about 10 million rows. When I do a select * using all 3 components of the key, essentially selecting just 1 row, the response time is almost 700 sec, which seems pretty bad.

For comparison purpose, I created another table with a simple string key, and the rest of the columns etc same. The key is a string UUID. Table has same number of column families and same number of rows.
CREATE EXTERNAL TABLE test_event(key string, blah blah.....
TBLPROPERTIES ("hbase.table.name<http://hbase.table.name>" = "test_event");

When I select a single row from this table by doing select * where key='something', the response time is 35 sec.

This seems to indicate that in case of composite keys, there is a full table scan happening.  This seems weird.

What am I missing here? Is there something special I need to do to get good read performance if I am using composite keys ?
Insert performance in both cases is comparable and is as per expectation.

Any help is appreciated.
Here is the env spec:

Amazon EMR
Hbase Cluster- 3 core nodes with 7.5 GB RAM each, 2 CPUs of 2.2 GHz each. Master 7.5 GB RAM, 2 CPUs of 2.2 GHz each
Hive Cluster - 3 core nodes 3.75 GB RAM each, 1 CPU of 1.8 GHz. Master 3.75 GB RAM, 1 CPU of 1.8 GHz

Thanks
Rupinder



This email is intended for the person(s) to whom it is addressed and may contain information that is PRIVILEGED or CONFIDENTIAL. Any unauthorized use, distribution, copying, or disclosure by any person other than the addressee(s) is strictly prohibited. If you have received this email in error, please notify the sender immediately by return email and delete the message and any attachments from your system.




--
Swarnim

RE: Very poor read performance with composite keys in hbase

Posted by Rupinder Singh <rs...@care.com>.
Here it is:

select * from event where key.name='Signup' and key.dateCreated='2013-03-06 16:39:55.353' and key.uid='7af4c330-5988-4255-9250-924ce5864e3bf';


From: kulkarni.swarnim@gmail.com [mailto:kulkarni.swarnim@gmail.com]
Sent: Tuesday, April 30, 2013 11:25 PM
To: user@hive.apache.org
Cc: user@hbase.apache.org
Subject: Re: Very poor read performance with composite keys in hbase

Can you show your query that is taking 700 seconds?

On Tue, Apr 30, 2013 at 12:48 PM, Rupinder Singh <rs...@care.com>> wrote:
Hi,

I have an hbase cluster where I have a table with a composite key. I map this table to a Hive external table using which I insert/select data into/from this table:
CREATE EXTERNAL TABLE event(key struct<name:string,dateCreated:string,uid:string>, {more columns here})
ROW FORMAT DELIMITED
COLLECTION ITEMS TERMINATED BY '~'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, other columns ")
TBLPROPERTIES ("hbase.table.name<http://hbase.table.name>" = "event");

The table has about 10 million rows. When I do a select * using all 3 components of the key, essentially selecting just 1 row, the response time is almost 700 sec, which seems pretty bad.

For comparison purpose, I created another table with a simple string key, and the rest of the columns etc same. The key is a string UUID. Table has same number of column families and same number of rows.
CREATE EXTERNAL TABLE test_event(key string, blah blah.....
TBLPROPERTIES ("hbase.table.name<http://hbase.table.name>" = "test_event");

When I select a single row from this table by doing select * where key='something', the response time is 35 sec.

This seems to indicate that in case of composite keys, there is a full table scan happening.  This seems weird.

What am I missing here? Is there something special I need to do to get good read performance if I am using composite keys ?
Insert performance in both cases is comparable and is as per expectation.

Any help is appreciated.
Here is the env spec:

Amazon EMR
Hbase Cluster- 3 core nodes with 7.5 GB RAM each, 2 CPUs of 2.2 GHz each. Master 7.5 GB RAM, 2 CPUs of 2.2 GHz each
Hive Cluster - 3 core nodes 3.75 GB RAM each, 1 CPU of 1.8 GHz. Master 3.75 GB RAM, 1 CPU of 1.8 GHz

Thanks
Rupinder



This email is intended for the person(s) to whom it is addressed and may contain information that is PRIVILEGED or CONFIDENTIAL. Any unauthorized use, distribution, copying, or disclosure by any person other than the addressee(s) is strictly prohibited. If you have received this email in error, please notify the sender immediately by return email and delete the message and any attachments from your system.




--
Swarnim

Re: Very poor read performance with composite keys in hbase

Posted by "kulkarni.swarnim@gmail.com" <ku...@gmail.com>.
Can you show your query that is taking 700 seconds?


On Tue, Apr 30, 2013 at 12:48 PM, Rupinder Singh <rs...@care.com> wrote:

>  Hi,****
>
> ** **
>
> I have an hbase cluster where I have a table with a composite key. I map
> this table to a Hive external table using which I insert/select data
> into/from this table:****
>
> CREATE EXTERNAL TABLE event(key
> struct<name:string,dateCreated:string,uid:string>, {more columns here})***
> *
>
> ROW FORMAT DELIMITED****
>
> COLLECTION ITEMS TERMINATED BY '~'****
>
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'****
>
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, other columns ")***
> *
>
> TBLPROPERTIES ("hbase.table.name" = "event");****
>
> ** **
>
> The table has about 10 million rows. When I do a select * using all 3
> components of the key, essentially selecting just 1 row, the response time
> is almost 700 sec, which seems pretty bad.****
>
> ** **
>
> For comparison purpose, I created another table with a simple string key,
> and the rest of the columns etc same. The key is a string UUID. Table has
> same number of column families and same number of rows.****
>
> CREATE EXTERNAL TABLE test_event(key string, blah blah…..****
>
> TBLPROPERTIES ("hbase.table.name" = "test_event");****
>
> ** **
>
> When I select a single row from this table by doing select * where
> key=’something’, the response time is 35 sec.****
>
> ** **
>
> This seems to indicate that in case of composite keys, there is a full
> table scan happening.  This seems weird.****
>
> ** **
>
> What am I missing here? Is there something special I need to do to get
> good read performance if I am using composite keys ?****
>
> Insert performance in both cases is comparable and is as per expectation.*
> ***
>
> ** **
>
> Any help is appreciated.****
>
> Here is the env spec:****
>
> ** **
>
> Amazon EMR****
>
> Hbase Cluster- 3 core nodes with 7.5 GB RAM each, 2 CPUs of 2.2 GHz each.
> Master 7.5 GB RAM, 2 CPUs of 2.2 GHz each****
>
> Hive Cluster – 3 core nodes 3.75 GB RAM each, 1 CPU of 1.8 GHz. Master
> 3.75 GB RAM, 1 CPU of 1.8 GHz****
>
> ** **
>
> Thanks****
>
> Rupinder****
>
>
> This email is intended for the person(s) to whom it is addressed and may
> contain information that is PRIVILEGED or CONFIDENTIAL. Any unauthorized
> use, distribution, copying, or disclosure by any person other than the
> addressee(s) is strictly prohibited. If you have received this email in
> error, please notify the sender immediately by return email and delete the
> message and any attachments from your system.
>
>


-- 
Swarnim

Re: Very poor read performance with composite keys in hbase

Posted by James Taylor <jt...@salesforce.com>.
Have you had a look at Phoenix (https://github.com/forcedotcom/phoenix)? It'll use all of the parts of your row key and depending on how much data you're returning back to the client, will query over 10 million row in seconds.

James
@JamesPlusPlus
http://phoenix-hbase.blogspot.com

On Apr 30, 2013, at 1:59 PM, "kulkarni.swarnim@gmail.com" <ku...@gmail.com> wrote:

> That depends on how dynamic your data is. If it is pretty static, you can
> also consider using something like Create Table As Select (CTAS) to create
> a snapshot of your data to HDFS and then run queries on top of that data.
> 
> So your query might become something like:
> 
> create table my_table as select * from event where key.name=’Signup’ and
> key.dateCreated=’2013-03-06 16:39:55.353’ and key.uid=’7af4c330-5988-4255-
> 9250-924ce5864e3bf’;
> 
> Since your data is now in HDFS, this should give you a considerable
> performance boost.
> 
> 
> On Tue, Apr 30, 2013 at 3:00 PM, Rupinder Singh <rs...@care.com> wrote:
> 
>> Swarnim,****
>> 
>> ** **
>> 
>> Thanks. So this means custom map reduce is the viable option when working
>> with hbase tables having composite keys, since it allows to set the start
>> and stop keys. Hive+Hbase combination is out.****
>> 
>> ** **
>> 
>> Regards****
>> 
>> Rupinder****
>> 
>> ** **
>> 
>> *From:* kulkarni.swarnim@gmail.com [mailto:kulkarni.swarnim@gmail.com]
>> *Sent:* Wednesday, May 01, 2013 12:17 AM
>> 
>> *To:* user@hive.apache.org
>> *Cc:* user@hbase.apache.org
>> *Subject:* Re: Very poor read performance with composite keys in hbase****
>> 
>> ** **
>> 
>> Rupinder,****
>> 
>> ** **
>> 
>> Hive supports a filter pushdown[1] which means that the predicates in the
>> where clause are pushed down to the storage handler level where either they
>> get handled by the storage handler or delegated to hive if they cannot
>> handle them. As of now, the HBaseStorageHandler only supports primitive
>> types. So when you use strings as keys, behind the scenes they get
>> converted to start and stop keys and restrict the hbase scan. This does not
>> happen for structs. Hence you see a full table scan causing bad performance.
>> ****
>> 
>> ** **
>> 
>> [1] https://cwiki.apache.org/Hive/filterpushdowndev.html****
>> 
>> ** **
>> 
>> On Tue, Apr 30, 2013 at 1:04 PM, Sanjay Subramanian <
>> Sanjay.Subramanian@wizecommerce.com> wrote:****
>> 
>> My experience with hive + hbase has been about 8x slower on an average. So
>> I went ahead with hive only option.
>> 
>> Sent from my iPhone****
>> 
>> 
>> On Apr 30, 2013, at 11:19 PM, "Rupinder Singh" <rs...@care.com> wrote:***
>> *
>> 
>> Hi,****
>> 
>> ****
>> 
>> I have an hbase cluster where I have a table with a composite key. I map
>> this table to a Hive external table using which I insert/select data
>> into/from this table:****
>> 
>> CREATE EXTERNAL TABLE event(key
>> struct<name:string,dateCreated:string,uid:string>, {more columns here})***
>> *
>> 
>> ROW FORMAT DELIMITED****
>> 
>> COLLECTION ITEMS TERMINATED BY '~'****
>> 
>> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'****
>> 
>> WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, other columns ")***
>> *
>> 
>> TBLPROPERTIES ("hbase.table.name" = "event");****
>> 
>> ****
>> 
>> The table has about 10 million rows. When I do a select * using all 3
>> components of the key, essentially selecting just 1 row, the response time
>> is almost 700 sec, which seems pretty bad.****
>> 
>> ****
>> 
>> For comparison purpose, I created another table with a simple string key,
>> and the rest of the columns etc same. The key is a string UUID. Table has
>> same number of column families and same number of rows.****
>> 
>> CREATE EXTERNAL TABLE test_event(key string, blah blah…..****
>> 
>> TBLPROPERTIES ("hbase.table.name" = "test_event");****
>> 
>> ****
>> 
>> When I select a single row from this table by doing select * where
>> key=’something’, the response time is 35 sec.****
>> 
>> ****
>> 
>> This seems to indicate that in case of composite keys, there is a full
>> table scan happening.  This seems weird.****
>> 
>> ****
>> 
>> What am I missing here? Is there something special I need to do to get
>> good read performance if I am using composite keys ?****
>> 
>> Insert performance in both cases is comparable and is as per expectation.*
>> ***
>> 
>> ****
>> 
>> Any help is appreciated.****
>> 
>> Here is the env spec:****
>> 
>> ****
>> 
>> Amazon EMR****
>> 
>> Hbase Cluster- 3 core nodes with 7.5 GB RAM each, 2 CPUs of 2.2 GHz each.
>> Master 7.5 GB RAM, 2 CPUs of 2.2 GHz each****
>> 
>> Hive Cluster – 3 core nodes 3.75 GB RAM each, 1 CPU of 1.8 GHz. Master
>> 3.75 GB RAM, 1 CPU of 1.8 GHz****
>> 
>> ****
>> 
>> Thanks****
>> 
>> Rupinder****
>> 
>> ** **
>> 
>> ** **
>> 
>> This email is intended for the person(s) to whom it is addressed and may
>> contain information that is PRIVILEGED or CONFIDENTIAL. Any unauthorized
>> use, distribution, copying, or disclosure by any person other than the
>> addressee(s) is strictly prohibited. If you have received this email in
>> error, please notify the sender immediately by return email and delete the
>> message and any attachments from your system.****
>> 
>> ** **
>> 
>> ** **
>> 
>> CONFIDENTIALITY NOTICE
>> ======================
>> This email message and any attachments are for the exclusive use of the
>> intended recipient(s) and may contain confidential and privileged
>> information. Any unauthorized review, use, disclosure or distribution is
>> prohibited. If you are not the intended recipient, please contact the
>> sender by reply email and destroy all copies of the original message along
>> with any attachments, from your computer system. If you are the intended
>> recipient, please be advised that the content of this message is subject to
>> access, review and disclosure by the sender's Email System Administrator.*
>> ***
>> 
>> 
>> 
>> ****
>> 
>> ** **
>> 
>> --
>> Swarnim ****
>> 
> 
> 
> 
> -- 
> Swarnim

Re: Very poor read performance with composite keys in hbase

Posted by Anoop John <an...@gmail.com>.
Navis
        Thanks for the issue link. Currently the read queries will start MR
jobs as usual for reading from HBase. Correct?  Is there any plan for
supporting noMR?

-Anoop-


On Thu, May 2, 2013 at 7:09 AM, Navis류승우 <na...@nexr.com> wrote:

> Currently, hive storage handler reads rows one by one.
>
> https://issues.apache.org/jira/browse/HIVE-3603 is for setting cache
> size, which is not yet fixed.
>
> 2013/5/1 kulkarni.swarnim@gmail.com <ku...@gmail.com>:
> > That depends on how dynamic your data is. If it is pretty static, you can
> > also consider using something like Create Table As Select (CTAS) to
> create a
> > snapshot of your data to HDFS and then run queries on top of that data.
> >
> > So your query might become something like:
> >
> > create table my_table as select * from event where key.name=’Signup’ and
> > key.dateCreated=’2013-03-06 16:39:55.353’ and
> > key.uid=’7af4c330-5988-4255-9250-924ce5864e3bf’;
> >
> > Since your data is now in HDFS, this should give you a considerable
> > performance boost.
> >
> >
> > On Tue, Apr 30, 2013 at 3:00 PM, Rupinder Singh <rs...@care.com> wrote:
> >>
> >> Swarnim,
> >>
> >>
> >>
> >> Thanks. So this means custom map reduce is the viable option when
> working
> >> with hbase tables having composite keys, since it allows to set the
> start
> >> and stop keys. Hive+Hbase combination is out.
> >>
> >>
> >>
> >> Regards
> >>
> >> Rupinder
> >>
> >>
> >>
> >> From: kulkarni.swarnim@gmail.com [mailto:kulkarni.swarnim@gmail.com]
> >> Sent: Wednesday, May 01, 2013 12:17 AM
> >>
> >>
> >> To: user@hive.apache.org
> >> Cc: user@hbase.apache.org
> >> Subject: Re: Very poor read performance with composite keys in hbase
> >>
> >>
> >>
> >> Rupinder,
> >>
> >>
> >>
> >> Hive supports a filter pushdown[1] which means that the predicates in
> the
> >> where clause are pushed down to the storage handler level where either
> they
> >> get handled by the storage handler or delegated to hive if they cannot
> >> handle them. As of now, the HBaseStorageHandler only supports primitive
> >> types. So when you use strings as keys, behind the scenes they get
> converted
> >> to start and stop keys and restrict the hbase scan. This does not
> happen for
> >> structs. Hence you see a full table scan causing bad performance.
> >>
> >>
> >>
> >> [1] https://cwiki.apache.org/Hive/filterpushdowndev.html
> >>
> >>
> >>
> >> On Tue, Apr 30, 2013 at 1:04 PM, Sanjay Subramanian
> >> <Sa...@wizecommerce.com> wrote:
> >>
> >> My experience with hive + hbase has been about 8x slower on an average.
> So
> >> I went ahead with hive only option.
> >>
> >> Sent from my iPhone
> >>
> >>
> >> On Apr 30, 2013, at 11:19 PM, "Rupinder Singh" <rs...@care.com> wrote:
> >>
> >> Hi,
> >>
> >>
> >>
> >> I have an hbase cluster where I have a table with a composite key. I map
> >> this table to a Hive external table using which I insert/select data
> >> into/from this table:
> >>
> >> CREATE EXTERNAL TABLE event(key
> >> struct<name:string,dateCreated:string,uid:string>, {more columns here})
> >>
> >> ROW FORMAT DELIMITED
> >>
> >> COLLECTION ITEMS TERMINATED BY '~'
> >>
> >> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
> >>
> >> WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, other columns ")
> >>
> >> TBLPROPERTIES ("hbase.table.name" = "event");
> >>
> >>
> >>
> >> The table has about 10 million rows. When I do a select * using all 3
> >> components of the key, essentially selecting just 1 row, the response
> time
> >> is almost 700 sec, which seems pretty bad.
> >>
> >>
> >>
> >> For comparison purpose, I created another table with a simple string
> key,
> >> and the rest of the columns etc same. The key is a string UUID. Table
> has
> >> same number of column families and same number of rows.
> >>
> >> CREATE EXTERNAL TABLE test_event(key string, blah blah…..
> >>
> >> TBLPROPERTIES ("hbase.table.name" = "test_event");
> >>
> >>
> >>
> >> When I select a single row from this table by doing select * where
> >> key=’something’, the response time is 35 sec.
> >>
> >>
> >>
> >> This seems to indicate that in case of composite keys, there is a full
> >> table scan happening.  This seems weird.
> >>
> >>
> >>
> >> What am I missing here? Is there something special I need to do to get
> >> good read performance if I am using composite keys ?
> >>
> >> Insert performance in both cases is comparable and is as per
> expectation.
> >>
> >>
> >>
> >> Any help is appreciated.
> >>
> >> Here is the env spec:
> >>
> >>
> >>
> >> Amazon EMR
> >>
> >> Hbase Cluster- 3 core nodes with 7.5 GB RAM each, 2 CPUs of 2.2 GHz
> each.
> >> Master 7.5 GB RAM, 2 CPUs of 2.2 GHz each
> >>
> >> Hive Cluster – 3 core nodes 3.75 GB RAM each, 1 CPU of 1.8 GHz. Master
> >> 3.75 GB RAM, 1 CPU of 1.8 GHz
> >>
> >>
> >>
> >> Thanks
> >>
> >> Rupinder
> >>
> >>
> >>
> >>
> >>
> >> This email is intended for the person(s) to whom it is addressed and may
> >> contain information that is PRIVILEGED or CONFIDENTIAL. Any unauthorized
> >> use, distribution, copying, or disclosure by any person other than the
> >> addressee(s) is strictly prohibited. If you have received this email in
> >> error, please notify the sender immediately by return email and delete
> the
> >> message and any attachments from your system.
> >>
> >>
> >>
> >>
> >>
> >> CONFIDENTIALITY NOTICE
> >> ======================
> >> This email message and any attachments are for the exclusive use of the
> >> intended recipient(s) and may contain confidential and privileged
> >> information. Any unauthorized review, use, disclosure or distribution is
> >> prohibited. If you are not the intended recipient, please contact the
> sender
> >> by reply email and destroy all copies of the original message along
> with any
> >> attachments, from your computer system. If you are the intended
> recipient,
> >> please be advised that the content of this message is subject to access,
> >> review and disclosure by the sender's Email System Administrator.
> >>
> >>
> >>
> >>
> >>
> >> --
> >> Swarnim
> >
> >
> >
> >
> > --
> > Swarnim
>

Re: Very poor read performance with composite keys in hbase

Posted by Navis류승우 <na...@nexr.com>.
Currently, hive storage handler reads rows one by one.

https://issues.apache.org/jira/browse/HIVE-3603 is for setting cache
size, which is not yet fixed.

2013/5/1 kulkarni.swarnim@gmail.com <ku...@gmail.com>:
> That depends on how dynamic your data is. If it is pretty static, you can
> also consider using something like Create Table As Select (CTAS) to create a
> snapshot of your data to HDFS and then run queries on top of that data.
>
> So your query might become something like:
>
> create table my_table as select * from event where key.name=’Signup’ and
> key.dateCreated=’2013-03-06 16:39:55.353’ and
> key.uid=’7af4c330-5988-4255-9250-924ce5864e3bf’;
>
> Since your data is now in HDFS, this should give you a considerable
> performance boost.
>
>
> On Tue, Apr 30, 2013 at 3:00 PM, Rupinder Singh <rs...@care.com> wrote:
>>
>> Swarnim,
>>
>>
>>
>> Thanks. So this means custom map reduce is the viable option when working
>> with hbase tables having composite keys, since it allows to set the start
>> and stop keys. Hive+Hbase combination is out.
>>
>>
>>
>> Regards
>>
>> Rupinder
>>
>>
>>
>> From: kulkarni.swarnim@gmail.com [mailto:kulkarni.swarnim@gmail.com]
>> Sent: Wednesday, May 01, 2013 12:17 AM
>>
>>
>> To: user@hive.apache.org
>> Cc: user@hbase.apache.org
>> Subject: Re: Very poor read performance with composite keys in hbase
>>
>>
>>
>> Rupinder,
>>
>>
>>
>> Hive supports a filter pushdown[1] which means that the predicates in the
>> where clause are pushed down to the storage handler level where either they
>> get handled by the storage handler or delegated to hive if they cannot
>> handle them. As of now, the HBaseStorageHandler only supports primitive
>> types. So when you use strings as keys, behind the scenes they get converted
>> to start and stop keys and restrict the hbase scan. This does not happen for
>> structs. Hence you see a full table scan causing bad performance.
>>
>>
>>
>> [1] https://cwiki.apache.org/Hive/filterpushdowndev.html
>>
>>
>>
>> On Tue, Apr 30, 2013 at 1:04 PM, Sanjay Subramanian
>> <Sa...@wizecommerce.com> wrote:
>>
>> My experience with hive + hbase has been about 8x slower on an average. So
>> I went ahead with hive only option.
>>
>> Sent from my iPhone
>>
>>
>> On Apr 30, 2013, at 11:19 PM, "Rupinder Singh" <rs...@care.com> wrote:
>>
>> Hi,
>>
>>
>>
>> I have an hbase cluster where I have a table with a composite key. I map
>> this table to a Hive external table using which I insert/select data
>> into/from this table:
>>
>> CREATE EXTERNAL TABLE event(key
>> struct<name:string,dateCreated:string,uid:string>, {more columns here})
>>
>> ROW FORMAT DELIMITED
>>
>> COLLECTION ITEMS TERMINATED BY '~'
>>
>> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
>>
>> WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, other columns ")
>>
>> TBLPROPERTIES ("hbase.table.name" = "event");
>>
>>
>>
>> The table has about 10 million rows. When I do a select * using all 3
>> components of the key, essentially selecting just 1 row, the response time
>> is almost 700 sec, which seems pretty bad.
>>
>>
>>
>> For comparison purpose, I created another table with a simple string key,
>> and the rest of the columns etc same. The key is a string UUID. Table has
>> same number of column families and same number of rows.
>>
>> CREATE EXTERNAL TABLE test_event(key string, blah blah…..
>>
>> TBLPROPERTIES ("hbase.table.name" = "test_event");
>>
>>
>>
>> When I select a single row from this table by doing select * where
>> key=’something’, the response time is 35 sec.
>>
>>
>>
>> This seems to indicate that in case of composite keys, there is a full
>> table scan happening.  This seems weird.
>>
>>
>>
>> What am I missing here? Is there something special I need to do to get
>> good read performance if I am using composite keys ?
>>
>> Insert performance in both cases is comparable and is as per expectation.
>>
>>
>>
>> Any help is appreciated.
>>
>> Here is the env spec:
>>
>>
>>
>> Amazon EMR
>>
>> Hbase Cluster- 3 core nodes with 7.5 GB RAM each, 2 CPUs of 2.2 GHz each.
>> Master 7.5 GB RAM, 2 CPUs of 2.2 GHz each
>>
>> Hive Cluster – 3 core nodes 3.75 GB RAM each, 1 CPU of 1.8 GHz. Master
>> 3.75 GB RAM, 1 CPU of 1.8 GHz
>>
>>
>>
>> Thanks
>>
>> Rupinder
>>
>>
>>
>>
>>
>> This email is intended for the person(s) to whom it is addressed and may
>> contain information that is PRIVILEGED or CONFIDENTIAL. Any unauthorized
>> use, distribution, copying, or disclosure by any person other than the
>> addressee(s) is strictly prohibited. If you have received this email in
>> error, please notify the sender immediately by return email and delete the
>> message and any attachments from your system.
>>
>>
>>
>>
>>
>> CONFIDENTIALITY NOTICE
>> ======================
>> This email message and any attachments are for the exclusive use of the
>> intended recipient(s) and may contain confidential and privileged
>> information. Any unauthorized review, use, disclosure or distribution is
>> prohibited. If you are not the intended recipient, please contact the sender
>> by reply email and destroy all copies of the original message along with any
>> attachments, from your computer system. If you are the intended recipient,
>> please be advised that the content of this message is subject to access,
>> review and disclosure by the sender's Email System Administrator.
>>
>>
>>
>>
>>
>> --
>> Swarnim
>
>
>
>
> --
> Swarnim

Re: Very poor read performance with composite keys in hbase

Posted by Navis류승우 <na...@nexr.com>.
Currently, hive storage handler reads rows one by one.

https://issues.apache.org/jira/browse/HIVE-3603 is for setting cache
size, which is not yet fixed.

2013/5/1 kulkarni.swarnim@gmail.com <ku...@gmail.com>:
> That depends on how dynamic your data is. If it is pretty static, you can
> also consider using something like Create Table As Select (CTAS) to create a
> snapshot of your data to HDFS and then run queries on top of that data.
>
> So your query might become something like:
>
> create table my_table as select * from event where key.name=’Signup’ and
> key.dateCreated=’2013-03-06 16:39:55.353’ and
> key.uid=’7af4c330-5988-4255-9250-924ce5864e3bf’;
>
> Since your data is now in HDFS, this should give you a considerable
> performance boost.
>
>
> On Tue, Apr 30, 2013 at 3:00 PM, Rupinder Singh <rs...@care.com> wrote:
>>
>> Swarnim,
>>
>>
>>
>> Thanks. So this means custom map reduce is the viable option when working
>> with hbase tables having composite keys, since it allows to set the start
>> and stop keys. Hive+Hbase combination is out.
>>
>>
>>
>> Regards
>>
>> Rupinder
>>
>>
>>
>> From: kulkarni.swarnim@gmail.com [mailto:kulkarni.swarnim@gmail.com]
>> Sent: Wednesday, May 01, 2013 12:17 AM
>>
>>
>> To: user@hive.apache.org
>> Cc: user@hbase.apache.org
>> Subject: Re: Very poor read performance with composite keys in hbase
>>
>>
>>
>> Rupinder,
>>
>>
>>
>> Hive supports a filter pushdown[1] which means that the predicates in the
>> where clause are pushed down to the storage handler level where either they
>> get handled by the storage handler or delegated to hive if they cannot
>> handle them. As of now, the HBaseStorageHandler only supports primitive
>> types. So when you use strings as keys, behind the scenes they get converted
>> to start and stop keys and restrict the hbase scan. This does not happen for
>> structs. Hence you see a full table scan causing bad performance.
>>
>>
>>
>> [1] https://cwiki.apache.org/Hive/filterpushdowndev.html
>>
>>
>>
>> On Tue, Apr 30, 2013 at 1:04 PM, Sanjay Subramanian
>> <Sa...@wizecommerce.com> wrote:
>>
>> My experience with hive + hbase has been about 8x slower on an average. So
>> I went ahead with hive only option.
>>
>> Sent from my iPhone
>>
>>
>> On Apr 30, 2013, at 11:19 PM, "Rupinder Singh" <rs...@care.com> wrote:
>>
>> Hi,
>>
>>
>>
>> I have an hbase cluster where I have a table with a composite key. I map
>> this table to a Hive external table using which I insert/select data
>> into/from this table:
>>
>> CREATE EXTERNAL TABLE event(key
>> struct<name:string,dateCreated:string,uid:string>, {more columns here})
>>
>> ROW FORMAT DELIMITED
>>
>> COLLECTION ITEMS TERMINATED BY '~'
>>
>> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
>>
>> WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, other columns ")
>>
>> TBLPROPERTIES ("hbase.table.name" = "event");
>>
>>
>>
>> The table has about 10 million rows. When I do a select * using all 3
>> components of the key, essentially selecting just 1 row, the response time
>> is almost 700 sec, which seems pretty bad.
>>
>>
>>
>> For comparison purpose, I created another table with a simple string key,
>> and the rest of the columns etc same. The key is a string UUID. Table has
>> same number of column families and same number of rows.
>>
>> CREATE EXTERNAL TABLE test_event(key string, blah blah…..
>>
>> TBLPROPERTIES ("hbase.table.name" = "test_event");
>>
>>
>>
>> When I select a single row from this table by doing select * where
>> key=’something’, the response time is 35 sec.
>>
>>
>>
>> This seems to indicate that in case of composite keys, there is a full
>> table scan happening.  This seems weird.
>>
>>
>>
>> What am I missing here? Is there something special I need to do to get
>> good read performance if I am using composite keys ?
>>
>> Insert performance in both cases is comparable and is as per expectation.
>>
>>
>>
>> Any help is appreciated.
>>
>> Here is the env spec:
>>
>>
>>
>> Amazon EMR
>>
>> Hbase Cluster- 3 core nodes with 7.5 GB RAM each, 2 CPUs of 2.2 GHz each.
>> Master 7.5 GB RAM, 2 CPUs of 2.2 GHz each
>>
>> Hive Cluster – 3 core nodes 3.75 GB RAM each, 1 CPU of 1.8 GHz. Master
>> 3.75 GB RAM, 1 CPU of 1.8 GHz
>>
>>
>>
>> Thanks
>>
>> Rupinder
>>
>>
>>
>>
>>
>> This email is intended for the person(s) to whom it is addressed and may
>> contain information that is PRIVILEGED or CONFIDENTIAL. Any unauthorized
>> use, distribution, copying, or disclosure by any person other than the
>> addressee(s) is strictly prohibited. If you have received this email in
>> error, please notify the sender immediately by return email and delete the
>> message and any attachments from your system.
>>
>>
>>
>>
>>
>> CONFIDENTIALITY NOTICE
>> ======================
>> This email message and any attachments are for the exclusive use of the
>> intended recipient(s) and may contain confidential and privileged
>> information. Any unauthorized review, use, disclosure or distribution is
>> prohibited. If you are not the intended recipient, please contact the sender
>> by reply email and destroy all copies of the original message along with any
>> attachments, from your computer system. If you are the intended recipient,
>> please be advised that the content of this message is subject to access,
>> review and disclosure by the sender's Email System Administrator.
>>
>>
>>
>>
>>
>> --
>> Swarnim
>
>
>
>
> --
> Swarnim

Re: Very poor read performance with composite keys in hbase

Posted by "kulkarni.swarnim@gmail.com" <ku...@gmail.com>.
That depends on how dynamic your data is. If it is pretty static, you can
also consider using something like Create Table As Select (CTAS) to create
a snapshot of your data to HDFS and then run queries on top of that data.

So your query might become something like:

create table my_table as select * from event where key.name=’Signup’ and
key.dateCreated=’2013-03-06 16:39:55.353’ and key.uid=’7af4c330-5988-4255-
9250-924ce5864e3bf’;

Since your data is now in HDFS, this should give you a considerable
performance boost.


On Tue, Apr 30, 2013 at 3:00 PM, Rupinder Singh <rs...@care.com> wrote:

>  Swarnim,****
>
> ** **
>
> Thanks. So this means custom map reduce is the viable option when working
> with hbase tables having composite keys, since it allows to set the start
> and stop keys. Hive+Hbase combination is out.****
>
> ** **
>
> Regards****
>
> Rupinder****
>
> ** **
>
> *From:* kulkarni.swarnim@gmail.com [mailto:kulkarni.swarnim@gmail.com]
> *Sent:* Wednesday, May 01, 2013 12:17 AM
>
> *To:* user@hive.apache.org
> *Cc:* user@hbase.apache.org
> *Subject:* Re: Very poor read performance with composite keys in hbase****
>
>  ** **
>
> Rupinder,****
>
> ** **
>
> Hive supports a filter pushdown[1] which means that the predicates in the
> where clause are pushed down to the storage handler level where either they
> get handled by the storage handler or delegated to hive if they cannot
> handle them. As of now, the HBaseStorageHandler only supports primitive
> types. So when you use strings as keys, behind the scenes they get
> converted to start and stop keys and restrict the hbase scan. This does not
> happen for structs. Hence you see a full table scan causing bad performance.
> ****
>
> ** **
>
> [1] https://cwiki.apache.org/Hive/filterpushdowndev.html****
>
> ** **
>
> On Tue, Apr 30, 2013 at 1:04 PM, Sanjay Subramanian <
> Sanjay.Subramanian@wizecommerce.com> wrote:****
>
> My experience with hive + hbase has been about 8x slower on an average. So
> I went ahead with hive only option.
>
> Sent from my iPhone****
>
>
> On Apr 30, 2013, at 11:19 PM, "Rupinder Singh" <rs...@care.com> wrote:***
> *
>
>  Hi,****
>
>  ****
>
> I have an hbase cluster where I have a table with a composite key. I map
> this table to a Hive external table using which I insert/select data
> into/from this table:****
>
> CREATE EXTERNAL TABLE event(key
> struct<name:string,dateCreated:string,uid:string>, {more columns here})***
> *
>
> ROW FORMAT DELIMITED****
>
> COLLECTION ITEMS TERMINATED BY '~'****
>
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'****
>
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, other columns ")***
> *
>
> TBLPROPERTIES ("hbase.table.name" = "event");****
>
>  ****
>
> The table has about 10 million rows. When I do a select * using all 3
> components of the key, essentially selecting just 1 row, the response time
> is almost 700 sec, which seems pretty bad.****
>
>  ****
>
> For comparison purpose, I created another table with a simple string key,
> and the rest of the columns etc same. The key is a string UUID. Table has
> same number of column families and same number of rows.****
>
> CREATE EXTERNAL TABLE test_event(key string, blah blah…..****
>
> TBLPROPERTIES ("hbase.table.name" = "test_event");****
>
>  ****
>
> When I select a single row from this table by doing select * where
> key=’something’, the response time is 35 sec.****
>
>  ****
>
> This seems to indicate that in case of composite keys, there is a full
> table scan happening.  This seems weird.****
>
>  ****
>
> What am I missing here? Is there something special I need to do to get
> good read performance if I am using composite keys ?****
>
> Insert performance in both cases is comparable and is as per expectation.*
> ***
>
>  ****
>
> Any help is appreciated.****
>
> Here is the env spec:****
>
>  ****
>
> Amazon EMR****
>
> Hbase Cluster- 3 core nodes with 7.5 GB RAM each, 2 CPUs of 2.2 GHz each.
> Master 7.5 GB RAM, 2 CPUs of 2.2 GHz each****
>
> Hive Cluster – 3 core nodes 3.75 GB RAM each, 1 CPU of 1.8 GHz. Master
> 3.75 GB RAM, 1 CPU of 1.8 GHz****
>
>  ****
>
> Thanks****
>
> Rupinder****
>
> ** **
>
> ** **
>
> This email is intended for the person(s) to whom it is addressed and may
> contain information that is PRIVILEGED or CONFIDENTIAL. Any unauthorized
> use, distribution, copying, or disclosure by any person other than the
> addressee(s) is strictly prohibited. If you have received this email in
> error, please notify the sender immediately by return email and delete the
> message and any attachments from your system.****
>
> ** **
>
> ** **
>
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.*
> ***
>
>
>
> ****
>
> ** **
>
> --
> Swarnim ****
>



-- 
Swarnim

Re: Very poor read performance with composite keys in hbase

Posted by "kulkarni.swarnim@gmail.com" <ku...@gmail.com>.
That depends on how dynamic your data is. If it is pretty static, you can
also consider using something like Create Table As Select (CTAS) to create
a snapshot of your data to HDFS and then run queries on top of that data.

So your query might become something like:

create table my_table as select * from event where key.name=’Signup’ and
key.dateCreated=’2013-03-06 16:39:55.353’ and key.uid=’7af4c330-5988-4255-
9250-924ce5864e3bf’;

Since your data is now in HDFS, this should give you a considerable
performance boost.


On Tue, Apr 30, 2013 at 3:00 PM, Rupinder Singh <rs...@care.com> wrote:

>  Swarnim,****
>
> ** **
>
> Thanks. So this means custom map reduce is the viable option when working
> with hbase tables having composite keys, since it allows to set the start
> and stop keys. Hive+Hbase combination is out.****
>
> ** **
>
> Regards****
>
> Rupinder****
>
> ** **
>
> *From:* kulkarni.swarnim@gmail.com [mailto:kulkarni.swarnim@gmail.com]
> *Sent:* Wednesday, May 01, 2013 12:17 AM
>
> *To:* user@hive.apache.org
> *Cc:* user@hbase.apache.org
> *Subject:* Re: Very poor read performance with composite keys in hbase****
>
>  ** **
>
> Rupinder,****
>
> ** **
>
> Hive supports a filter pushdown[1] which means that the predicates in the
> where clause are pushed down to the storage handler level where either they
> get handled by the storage handler or delegated to hive if they cannot
> handle them. As of now, the HBaseStorageHandler only supports primitive
> types. So when you use strings as keys, behind the scenes they get
> converted to start and stop keys and restrict the hbase scan. This does not
> happen for structs. Hence you see a full table scan causing bad performance.
> ****
>
> ** **
>
> [1] https://cwiki.apache.org/Hive/filterpushdowndev.html****
>
> ** **
>
> On Tue, Apr 30, 2013 at 1:04 PM, Sanjay Subramanian <
> Sanjay.Subramanian@wizecommerce.com> wrote:****
>
> My experience with hive + hbase has been about 8x slower on an average. So
> I went ahead with hive only option.
>
> Sent from my iPhone****
>
>
> On Apr 30, 2013, at 11:19 PM, "Rupinder Singh" <rs...@care.com> wrote:***
> *
>
>  Hi,****
>
>  ****
>
> I have an hbase cluster where I have a table with a composite key. I map
> this table to a Hive external table using which I insert/select data
> into/from this table:****
>
> CREATE EXTERNAL TABLE event(key
> struct<name:string,dateCreated:string,uid:string>, {more columns here})***
> *
>
> ROW FORMAT DELIMITED****
>
> COLLECTION ITEMS TERMINATED BY '~'****
>
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'****
>
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, other columns ")***
> *
>
> TBLPROPERTIES ("hbase.table.name" = "event");****
>
>  ****
>
> The table has about 10 million rows. When I do a select * using all 3
> components of the key, essentially selecting just 1 row, the response time
> is almost 700 sec, which seems pretty bad.****
>
>  ****
>
> For comparison purpose, I created another table with a simple string key,
> and the rest of the columns etc same. The key is a string UUID. Table has
> same number of column families and same number of rows.****
>
> CREATE EXTERNAL TABLE test_event(key string, blah blah…..****
>
> TBLPROPERTIES ("hbase.table.name" = "test_event");****
>
>  ****
>
> When I select a single row from this table by doing select * where
> key=’something’, the response time is 35 sec.****
>
>  ****
>
> This seems to indicate that in case of composite keys, there is a full
> table scan happening.  This seems weird.****
>
>  ****
>
> What am I missing here? Is there something special I need to do to get
> good read performance if I am using composite keys ?****
>
> Insert performance in both cases is comparable and is as per expectation.*
> ***
>
>  ****
>
> Any help is appreciated.****
>
> Here is the env spec:****
>
>  ****
>
> Amazon EMR****
>
> Hbase Cluster- 3 core nodes with 7.5 GB RAM each, 2 CPUs of 2.2 GHz each.
> Master 7.5 GB RAM, 2 CPUs of 2.2 GHz each****
>
> Hive Cluster – 3 core nodes 3.75 GB RAM each, 1 CPU of 1.8 GHz. Master
> 3.75 GB RAM, 1 CPU of 1.8 GHz****
>
>  ****
>
> Thanks****
>
> Rupinder****
>
> ** **
>
> ** **
>
> This email is intended for the person(s) to whom it is addressed and may
> contain information that is PRIVILEGED or CONFIDENTIAL. Any unauthorized
> use, distribution, copying, or disclosure by any person other than the
> addressee(s) is strictly prohibited. If you have received this email in
> error, please notify the sender immediately by return email and delete the
> message and any attachments from your system.****
>
> ** **
>
> ** **
>
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.*
> ***
>
>
>
> ****
>
> ** **
>
> --
> Swarnim ****
>



-- 
Swarnim

RE: Very poor read performance with composite keys in hbase

Posted by Rupinder Singh <rs...@care.com>.
Swarnim,

Thanks. So this means custom map reduce is the viable option when working with hbase tables having composite keys, since it allows to set the start and stop keys. Hive+Hbase combination is out.

Regards
Rupinder

From: kulkarni.swarnim@gmail.com [mailto:kulkarni.swarnim@gmail.com]
Sent: Wednesday, May 01, 2013 12:17 AM
To: user@hive.apache.org
Cc: user@hbase.apache.org
Subject: Re: Very poor read performance with composite keys in hbase

Rupinder,

Hive supports a filter pushdown[1] which means that the predicates in the where clause are pushed down to the storage handler level where either they get handled by the storage handler or delegated to hive if they cannot handle them. As of now, the HBaseStorageHandler only supports primitive types. So when you use strings as keys, behind the scenes they get converted to start and stop keys and restrict the hbase scan. This does not happen for structs. Hence you see a full table scan causing bad performance.

[1] https://cwiki.apache.org/Hive/filterpushdowndev.html

On Tue, Apr 30, 2013 at 1:04 PM, Sanjay Subramanian <Sa...@wizecommerce.com>> wrote:
My experience with hive + hbase has been about 8x slower on an average. So I went ahead with hive only option.

Sent from my iPhone

On Apr 30, 2013, at 11:19 PM, "Rupinder Singh" <rs...@care.com>> wrote:
Hi,

I have an hbase cluster where I have a table with a composite key. I map this table to a Hive external table using which I insert/select data into/from this table:
CREATE EXTERNAL TABLE event(key struct<name:string,dateCreated:string,uid:string>, {more columns here})
ROW FORMAT DELIMITED
COLLECTION ITEMS TERMINATED BY '~'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, other columns ")
TBLPROPERTIES ("hbase.table.name<http://hbase.table.name>" = "event");

The table has about 10 million rows. When I do a select * using all 3 components of the key, essentially selecting just 1 row, the response time is almost 700 sec, which seems pretty bad.

For comparison purpose, I created another table with a simple string key, and the rest of the columns etc same. The key is a string UUID. Table has same number of column families and same number of rows.
CREATE EXTERNAL TABLE test_event(key string, blah blah.....
TBLPROPERTIES ("hbase.table.name<http://hbase.table.name>" = "test_event");

When I select a single row from this table by doing select * where key='something', the response time is 35 sec.

This seems to indicate that in case of composite keys, there is a full table scan happening.  This seems weird.

What am I missing here? Is there something special I need to do to get good read performance if I am using composite keys ?
Insert performance in both cases is comparable and is as per expectation.

Any help is appreciated.
Here is the env spec:

Amazon EMR
Hbase Cluster- 3 core nodes with 7.5 GB RAM each, 2 CPUs of 2.2 GHz each. Master 7.5 GB RAM, 2 CPUs of 2.2 GHz each
Hive Cluster - 3 core nodes 3.75 GB RAM each, 1 CPU of 1.8 GHz. Master 3.75 GB RAM, 1 CPU of 1.8 GHz

Thanks
Rupinder



This email is intended for the person(s) to whom it is addressed and may contain information that is PRIVILEGED or CONFIDENTIAL. Any unauthorized use, distribution, copying, or disclosure by any person other than the addressee(s) is strictly prohibited. If you have received this email in error, please notify the sender immediately by return email and delete the message and any attachments from your system.


CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.



--
Swarnim

RE: Very poor read performance with composite keys in hbase

Posted by Rupinder Singh <rs...@care.com>.
Swarnim,

Thanks. So this means custom map reduce is the viable option when working with hbase tables having composite keys, since it allows to set the start and stop keys. Hive+Hbase combination is out.

Regards
Rupinder

From: kulkarni.swarnim@gmail.com [mailto:kulkarni.swarnim@gmail.com]
Sent: Wednesday, May 01, 2013 12:17 AM
To: user@hive.apache.org
Cc: user@hbase.apache.org
Subject: Re: Very poor read performance with composite keys in hbase

Rupinder,

Hive supports a filter pushdown[1] which means that the predicates in the where clause are pushed down to the storage handler level where either they get handled by the storage handler or delegated to hive if they cannot handle them. As of now, the HBaseStorageHandler only supports primitive types. So when you use strings as keys, behind the scenes they get converted to start and stop keys and restrict the hbase scan. This does not happen for structs. Hence you see a full table scan causing bad performance.

[1] https://cwiki.apache.org/Hive/filterpushdowndev.html

On Tue, Apr 30, 2013 at 1:04 PM, Sanjay Subramanian <Sa...@wizecommerce.com>> wrote:
My experience with hive + hbase has been about 8x slower on an average. So I went ahead with hive only option.

Sent from my iPhone

On Apr 30, 2013, at 11:19 PM, "Rupinder Singh" <rs...@care.com>> wrote:
Hi,

I have an hbase cluster where I have a table with a composite key. I map this table to a Hive external table using which I insert/select data into/from this table:
CREATE EXTERNAL TABLE event(key struct<name:string,dateCreated:string,uid:string>, {more columns here})
ROW FORMAT DELIMITED
COLLECTION ITEMS TERMINATED BY '~'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, other columns ")
TBLPROPERTIES ("hbase.table.name<http://hbase.table.name>" = "event");

The table has about 10 million rows. When I do a select * using all 3 components of the key, essentially selecting just 1 row, the response time is almost 700 sec, which seems pretty bad.

For comparison purpose, I created another table with a simple string key, and the rest of the columns etc same. The key is a string UUID. Table has same number of column families and same number of rows.
CREATE EXTERNAL TABLE test_event(key string, blah blah.....
TBLPROPERTIES ("hbase.table.name<http://hbase.table.name>" = "test_event");

When I select a single row from this table by doing select * where key='something', the response time is 35 sec.

This seems to indicate that in case of composite keys, there is a full table scan happening.  This seems weird.

What am I missing here? Is there something special I need to do to get good read performance if I am using composite keys ?
Insert performance in both cases is comparable and is as per expectation.

Any help is appreciated.
Here is the env spec:

Amazon EMR
Hbase Cluster- 3 core nodes with 7.5 GB RAM each, 2 CPUs of 2.2 GHz each. Master 7.5 GB RAM, 2 CPUs of 2.2 GHz each
Hive Cluster - 3 core nodes 3.75 GB RAM each, 1 CPU of 1.8 GHz. Master 3.75 GB RAM, 1 CPU of 1.8 GHz

Thanks
Rupinder



This email is intended for the person(s) to whom it is addressed and may contain information that is PRIVILEGED or CONFIDENTIAL. Any unauthorized use, distribution, copying, or disclosure by any person other than the addressee(s) is strictly prohibited. If you have received this email in error, please notify the sender immediately by return email and delete the message and any attachments from your system.


CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.



--
Swarnim

Re: Very poor read performance with composite keys in hbase

Posted by "kulkarni.swarnim@gmail.com" <ku...@gmail.com>.
Rupinder,

Hive supports a filter pushdown[1] which means that the predicates in the
where clause are pushed down to the storage handler level where either they
get handled by the storage handler or delegated to hive if they cannot
handle them. As of now, the HBaseStorageHandler only supports primitive
types. So when you use strings as keys, behind the scenes they get
converted to start and stop keys and restrict the hbase scan. This does not
happen for structs. Hence you see a full table scan causing bad performance.

[1] https://cwiki.apache.org/Hive/filterpushdowndev.html


On Tue, Apr 30, 2013 at 1:04 PM, Sanjay Subramanian <
Sanjay.Subramanian@wizecommerce.com> wrote:

>  My experience with hive + hbase has been about 8x slower on an average.
> So I went ahead with hive only option.
>
> Sent from my iPhone
>
> On Apr 30, 2013, at 11:19 PM, "Rupinder Singh" <rs...@care.com> wrote:
>
>   Hi,
>
>
>
> I have an hbase cluster where I have a table with a composite key. I map
> this table to a Hive external table using which I insert/select data
> into/from this table:
>
> CREATE EXTERNAL TABLE event(key
> struct<name:string,dateCreated:string,uid:string>, {more columns here})
>
> ROW FORMAT DELIMITED
>
> COLLECTION ITEMS TERMINATED BY '~'
>
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
>
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, other columns ")
>
> TBLPROPERTIES ("hbase.table.name" = "event");
>
>
>
> The table has about 10 million rows. When I do a select * using all 3
> components of the key, essentially selecting just 1 row, the response time
> is almost 700 sec, which seems pretty bad.
>
>
>
> For comparison purpose, I created another table with a simple string key,
> and the rest of the columns etc same. The key is a string UUID. Table has
> same number of column families and same number of rows.
>
> CREATE EXTERNAL TABLE test_event(key string, blah blah…..
>
> TBLPROPERTIES ("hbase.table.name" = "test_event");
>
>
>
> When I select a single row from this table by doing select * where
> key=’something’, the response time is 35 sec.
>
>
>
> This seems to indicate that in case of composite keys, there is a full
> table scan happening.  This seems weird.
>
>
>
> What am I missing here? Is there something special I need to do to get
> good read performance if I am using composite keys ?
>
> Insert performance in both cases is comparable and is as per expectation.
>
>
>
> Any help is appreciated.
>
> Here is the env spec:
>
>
>
> Amazon EMR
>
> Hbase Cluster- 3 core nodes with 7.5 GB RAM each, 2 CPUs of 2.2 GHz each.
> Master 7.5 GB RAM, 2 CPUs of 2.2 GHz each
>
> Hive Cluster – 3 core nodes 3.75 GB RAM each, 1 CPU of 1.8 GHz. Master
> 3.75 GB RAM, 1 CPU of 1.8 GHz
>
>
>
> Thanks
>
> Rupinder
>
>
> This email is intended for the person(s) to whom it is addressed and may
> contain information that is PRIVILEGED or CONFIDENTIAL. Any unauthorized
> use, distribution, copying, or disclosure by any person other than the
> addressee(s) is strictly prohibited. If you have received this email in
> error, please notify the sender immediately by return email and delete the
> message and any attachments from your system.
>
>
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
>



-- 
Swarnim

Re: Very poor read performance with composite keys in hbase

Posted by "kulkarni.swarnim@gmail.com" <ku...@gmail.com>.
Rupinder,

Hive supports a filter pushdown[1] which means that the predicates in the
where clause are pushed down to the storage handler level where either they
get handled by the storage handler or delegated to hive if they cannot
handle them. As of now, the HBaseStorageHandler only supports primitive
types. So when you use strings as keys, behind the scenes they get
converted to start and stop keys and restrict the hbase scan. This does not
happen for structs. Hence you see a full table scan causing bad performance.

[1] https://cwiki.apache.org/Hive/filterpushdowndev.html


On Tue, Apr 30, 2013 at 1:04 PM, Sanjay Subramanian <
Sanjay.Subramanian@wizecommerce.com> wrote:

>  My experience with hive + hbase has been about 8x slower on an average.
> So I went ahead with hive only option.
>
> Sent from my iPhone
>
> On Apr 30, 2013, at 11:19 PM, "Rupinder Singh" <rs...@care.com> wrote:
>
>   Hi,
>
>
>
> I have an hbase cluster where I have a table with a composite key. I map
> this table to a Hive external table using which I insert/select data
> into/from this table:
>
> CREATE EXTERNAL TABLE event(key
> struct<name:string,dateCreated:string,uid:string>, {more columns here})
>
> ROW FORMAT DELIMITED
>
> COLLECTION ITEMS TERMINATED BY '~'
>
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
>
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, other columns ")
>
> TBLPROPERTIES ("hbase.table.name" = "event");
>
>
>
> The table has about 10 million rows. When I do a select * using all 3
> components of the key, essentially selecting just 1 row, the response time
> is almost 700 sec, which seems pretty bad.
>
>
>
> For comparison purpose, I created another table with a simple string key,
> and the rest of the columns etc same. The key is a string UUID. Table has
> same number of column families and same number of rows.
>
> CREATE EXTERNAL TABLE test_event(key string, blah blah…..
>
> TBLPROPERTIES ("hbase.table.name" = "test_event");
>
>
>
> When I select a single row from this table by doing select * where
> key=’something’, the response time is 35 sec.
>
>
>
> This seems to indicate that in case of composite keys, there is a full
> table scan happening.  This seems weird.
>
>
>
> What am I missing here? Is there something special I need to do to get
> good read performance if I am using composite keys ?
>
> Insert performance in both cases is comparable and is as per expectation.
>
>
>
> Any help is appreciated.
>
> Here is the env spec:
>
>
>
> Amazon EMR
>
> Hbase Cluster- 3 core nodes with 7.5 GB RAM each, 2 CPUs of 2.2 GHz each.
> Master 7.5 GB RAM, 2 CPUs of 2.2 GHz each
>
> Hive Cluster – 3 core nodes 3.75 GB RAM each, 1 CPU of 1.8 GHz. Master
> 3.75 GB RAM, 1 CPU of 1.8 GHz
>
>
>
> Thanks
>
> Rupinder
>
>
> This email is intended for the person(s) to whom it is addressed and may
> contain information that is PRIVILEGED or CONFIDENTIAL. Any unauthorized
> use, distribution, copying, or disclosure by any person other than the
> addressee(s) is strictly prohibited. If you have received this email in
> error, please notify the sender immediately by return email and delete the
> message and any attachments from your system.
>
>
> CONFIDENTIALITY NOTICE
> ======================
> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
>



-- 
Swarnim

Re: Very poor read performance with composite keys in hbase

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.
My experience with hive + hbase has been about 8x slower on an average. So I went ahead with hive only option.

Sent from my iPhone

On Apr 30, 2013, at 11:19 PM, "Rupinder Singh" <rs...@care.com>> wrote:

Hi,

I have an hbase cluster where I have a table with a composite key. I map this table to a Hive external table using which I insert/select data into/from this table:
CREATE EXTERNAL TABLE event(key struct<name:string,dateCreated:string,uid:string>, {more columns here})
ROW FORMAT DELIMITED
COLLECTION ITEMS TERMINATED BY '~'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, other columns ")
TBLPROPERTIES ("hbase.table.name" = "event");

The table has about 10 million rows. When I do a select * using all 3 components of the key, essentially selecting just 1 row, the response time is almost 700 sec, which seems pretty bad.

For comparison purpose, I created another table with a simple string key, and the rest of the columns etc same. The key is a string UUID. Table has same number of column families and same number of rows.
CREATE EXTERNAL TABLE test_event(key string, blah blah…..
TBLPROPERTIES ("hbase.table.name" = "test_event");

When I select a single row from this table by doing select * where key=’something’, the response time is 35 sec.

This seems to indicate that in case of composite keys, there is a full table scan happening.  This seems weird.

What am I missing here? Is there something special I need to do to get good read performance if I am using composite keys ?
Insert performance in both cases is comparable and is as per expectation.

Any help is appreciated.
Here is the env spec:

Amazon EMR
Hbase Cluster- 3 core nodes with 7.5 GB RAM each, 2 CPUs of 2.2 GHz each. Master 7.5 GB RAM, 2 CPUs of 2.2 GHz each
Hive Cluster – 3 core nodes 3.75 GB RAM each, 1 CPU of 1.8 GHz. Master 3.75 GB RAM, 1 CPU of 1.8 GHz

Thanks
Rupinder



This email is intended for the person(s) to whom it is addressed and may contain information that is PRIVILEGED or CONFIDENTIAL. Any unauthorized use, distribution, copying, or disclosure by any person other than the addressee(s) is strictly prohibited. If you have received this email in error, please notify the sender immediately by return email and delete the message and any attachments from your system.


CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Re: Very poor read performance with composite keys in hbase

Posted by Sanjay Subramanian <Sa...@wizecommerce.com>.
My experience with hive + hbase has been about 8x slower on an average. So I went ahead with hive only option.

Sent from my iPhone

On Apr 30, 2013, at 11:19 PM, "Rupinder Singh" <rs...@care.com>> wrote:

Hi,

I have an hbase cluster where I have a table with a composite key. I map this table to a Hive external table using which I insert/select data into/from this table:
CREATE EXTERNAL TABLE event(key struct<name:string,dateCreated:string,uid:string>, {more columns here})
ROW FORMAT DELIMITED
COLLECTION ITEMS TERMINATED BY '~'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, other columns ")
TBLPROPERTIES ("hbase.table.name" = "event");

The table has about 10 million rows. When I do a select * using all 3 components of the key, essentially selecting just 1 row, the response time is almost 700 sec, which seems pretty bad.

For comparison purpose, I created another table with a simple string key, and the rest of the columns etc same. The key is a string UUID. Table has same number of column families and same number of rows.
CREATE EXTERNAL TABLE test_event(key string, blah blah…..
TBLPROPERTIES ("hbase.table.name" = "test_event");

When I select a single row from this table by doing select * where key=’something’, the response time is 35 sec.

This seems to indicate that in case of composite keys, there is a full table scan happening.  This seems weird.

What am I missing here? Is there something special I need to do to get good read performance if I am using composite keys ?
Insert performance in both cases is comparable and is as per expectation.

Any help is appreciated.
Here is the env spec:

Amazon EMR
Hbase Cluster- 3 core nodes with 7.5 GB RAM each, 2 CPUs of 2.2 GHz each. Master 7.5 GB RAM, 2 CPUs of 2.2 GHz each
Hive Cluster – 3 core nodes 3.75 GB RAM each, 1 CPU of 1.8 GHz. Master 3.75 GB RAM, 1 CPU of 1.8 GHz

Thanks
Rupinder



This email is intended for the person(s) to whom it is addressed and may contain information that is PRIVILEGED or CONFIDENTIAL. Any unauthorized use, distribution, copying, or disclosure by any person other than the addressee(s) is strictly prohibited. If you have received this email in error, please notify the sender immediately by return email and delete the message and any attachments from your system.


CONFIDENTIALITY NOTICE
======================
This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.

Re: Very poor read performance with composite keys in hbase

Posted by "kulkarni.swarnim@gmail.com" <ku...@gmail.com>.
Can you show your query that is taking 700 seconds?


On Tue, Apr 30, 2013 at 12:48 PM, Rupinder Singh <rs...@care.com> wrote:

>  Hi,****
>
> ** **
>
> I have an hbase cluster where I have a table with a composite key. I map
> this table to a Hive external table using which I insert/select data
> into/from this table:****
>
> CREATE EXTERNAL TABLE event(key
> struct<name:string,dateCreated:string,uid:string>, {more columns here})***
> *
>
> ROW FORMAT DELIMITED****
>
> COLLECTION ITEMS TERMINATED BY '~'****
>
> STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'****
>
> WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, other columns ")***
> *
>
> TBLPROPERTIES ("hbase.table.name" = "event");****
>
> ** **
>
> The table has about 10 million rows. When I do a select * using all 3
> components of the key, essentially selecting just 1 row, the response time
> is almost 700 sec, which seems pretty bad.****
>
> ** **
>
> For comparison purpose, I created another table with a simple string key,
> and the rest of the columns etc same. The key is a string UUID. Table has
> same number of column families and same number of rows.****
>
> CREATE EXTERNAL TABLE test_event(key string, blah blah…..****
>
> TBLPROPERTIES ("hbase.table.name" = "test_event");****
>
> ** **
>
> When I select a single row from this table by doing select * where
> key=’something’, the response time is 35 sec.****
>
> ** **
>
> This seems to indicate that in case of composite keys, there is a full
> table scan happening.  This seems weird.****
>
> ** **
>
> What am I missing here? Is there something special I need to do to get
> good read performance if I am using composite keys ?****
>
> Insert performance in both cases is comparable and is as per expectation.*
> ***
>
> ** **
>
> Any help is appreciated.****
>
> Here is the env spec:****
>
> ** **
>
> Amazon EMR****
>
> Hbase Cluster- 3 core nodes with 7.5 GB RAM each, 2 CPUs of 2.2 GHz each.
> Master 7.5 GB RAM, 2 CPUs of 2.2 GHz each****
>
> Hive Cluster – 3 core nodes 3.75 GB RAM each, 1 CPU of 1.8 GHz. Master
> 3.75 GB RAM, 1 CPU of 1.8 GHz****
>
> ** **
>
> Thanks****
>
> Rupinder****
>
>
> This email is intended for the person(s) to whom it is addressed and may
> contain information that is PRIVILEGED or CONFIDENTIAL. Any unauthorized
> use, distribution, copying, or disclosure by any person other than the
> addressee(s) is strictly prohibited. If you have received this email in
> error, please notify the sender immediately by return email and delete the
> message and any attachments from your system.
>
>


-- 
Swarnim