You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Biju Kaimal <bi...@kaimal.net> on 2011/03/08 06:59:56 UTC

Performance between Hive queries vs. Hive over HBase queries

Hi,

I loaded a data set which has 1 million rows into both Hive and HBase
tables. For the HBase table, I created a corresponding Hive table so that
the data in HBase can be queried from Hive QL. Both tables have a key column
and a value column

For the same query (select value, count(*) from table group by value), the
Hive only query runs much faster (~ 30 seconds) as compared to Hive over
HBase (~ 150 seconds).

Is this expected?

Regards,
Biju

Re: Performance between Hive queries vs. Hive over HBase queries

Posted by Edward Capriolo <ed...@gmail.com>.
On Wed, Mar 9, 2011 at 4:31 PM, John Sichi <js...@fb.com> wrote:
> Factor of 5 closely matches the results I got when I was testing.
>
> JVS
>
> On Mar 9, 2011, at 1:23 PM, Otis Gospodnetic wrote:
>
>> Hi,
>>
>> Biju's example shows a factor of 5 decrease in performance when Hive points to
>> HBase tables.
>>
>> Does anyone know how much this factor varies?  Is if often closer to 1 or is is
>> more often close to 10?
>> Just trying to get a better feel for this...
>>
>> Thanks,
>> Otis
>> ----
>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>> Lucene ecosystem search :: http://search-lucene.com/
>>
>>
>>
>> ----- Original Message ----
>>> From: John Sichi <js...@fb.com>
>>> To: "<us...@hive.apache.org>" <us...@hive.apache.org>
>>> Sent: Tue, March 8, 2011 1:05:34 AM
>>> Subject: Re: Performance between Hive queries vs. Hive over HBase queries
>>>
>>> Yes.
>>>
>>> JVS
>>>
>>> On Mar 7, 2011, at 9:59 PM, Biju Kaimal  wrote:
>>>
>>>> Hi,
>>>>
>>>> I loaded a data set which has 1 million  rows into both Hive and HBase
>>> tables. For the HBase table, I created a  corresponding Hive table so that the
>>> data in HBase can be queried from Hive QL.  Both tables have a key column and a
>>> value column
>>>>
>>>> For the same  query (select value, count(*) from table group by value), the
>>> Hive only query  runs much faster (~ 30 seconds) as compared to Hive over HBase
>>> (~ 150  seconds).
>>>>
>>>> Is this expected?
>>>>
>>>> Regards,
>>>> Biju
>>>
>>>
>
>
There is going to be overhead. Data has to move
HDFS->RegionServer->TaskTracker. Another factor would be how many
column families are being spanned in your table search.

Re: Performance between Hive queries vs. Hive over HBase queries

Posted by John Sichi <js...@fb.com>.
Factor of 5 closely matches the results I got when I was testing.

JVS

On Mar 9, 2011, at 1:23 PM, Otis Gospodnetic wrote:

> Hi,
> 
> Biju's example shows a factor of 5 decrease in performance when Hive points to 
> HBase tables.
> 
> Does anyone know how much this factor varies?  Is if often closer to 1 or is is 
> more often close to 10?
> Just trying to get a better feel for this...
> 
> Thanks,
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
> 
> 
> 
> ----- Original Message ----
>> From: John Sichi <js...@fb.com>
>> To: "<us...@hive.apache.org>" <us...@hive.apache.org>
>> Sent: Tue, March 8, 2011 1:05:34 AM
>> Subject: Re: Performance between Hive queries vs. Hive over HBase queries
>> 
>> Yes.
>> 
>> JVS
>> 
>> On Mar 7, 2011, at 9:59 PM, Biju Kaimal  wrote:
>> 
>>> Hi,
>>> 
>>> I loaded a data set which has 1 million  rows into both Hive and HBase 
>> tables. For the HBase table, I created a  corresponding Hive table so that the 
>> data in HBase can be queried from Hive QL.  Both tables have a key column and a 
>> value column
>>> 
>>> For the same  query (select value, count(*) from table group by value), the 
>> Hive only query  runs much faster (~ 30 seconds) as compared to Hive over HBase 
>> (~ 150  seconds).
>>> 
>>> Is this expected?
>>> 
>>> Regards,
>>> Biju
>> 
>> 


Re: Performance between Hive queries vs. Hive over HBase queries

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi,

Biju's example shows a factor of 5 decrease in performance when Hive points to 
HBase tables.

Does anyone know how much this factor varies?  Is if often closer to 1 or is is 
more often close to 10?
Just trying to get a better feel for this...

Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: John Sichi <js...@fb.com>
> To: "<us...@hive.apache.org>" <us...@hive.apache.org>
> Sent: Tue, March 8, 2011 1:05:34 AM
> Subject: Re: Performance between Hive queries vs. Hive over HBase queries
> 
> Yes.
> 
> JVS
> 
> On Mar 7, 2011, at 9:59 PM, Biju Kaimal  wrote:
> 
> > Hi,
> > 
> > I loaded a data set which has 1 million  rows into both Hive and HBase 
>tables. For the HBase table, I created a  corresponding Hive table so that the 
>data in HBase can be queried from Hive QL.  Both tables have a key column and a 
>value column
> > 
> > For the same  query (select value, count(*) from table group by value), the 
>Hive only query  runs much faster (~ 30 seconds) as compared to Hive over HBase 
>(~ 150  seconds).
> > 
> > Is this expected?
> > 
> > Regards,
> >  Biju
> 
> 

Re: Performance between Hive queries vs. Hive over HBase queries

Posted by John Sichi <js...@fb.com>.
There's one here specifically for the Hive portion, but really a full-stack system profile is needed for deciding where to attack it:

https://issues.apache.org/jira/browse/HIVE-1231

I don't know of anyone currently working in this area.

JVS

On Mar 8, 2011, at 9:51 PM, Otis Gospodnetic wrote:

> Hi,
> 
> John, are there plans or specific JIRA issues related to this particular 
> performance hit that you or somebody else is working on and that those of us 
> interested in performance improvements when Hive points to external tables in 
> HBase should watch?
> 
> Thanks,
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
> 
> 
> 
> ----- Original Message ----
>> From: John Sichi <js...@fb.com>
>> To: "<us...@hive.apache.org>" <us...@hive.apache.org>
>> Sent: Tue, March 8, 2011 1:17:51 AM
>> Subject: Re: Performance between Hive queries vs. Hive over HBase queries
>> 
>> For native tables, Hive reads rows directly from HDFS.
>> 
>> For HBase tables,  it has to go through the HBase region servers, which 
>> reconstruct rows from  column families (combining cache + HDFS).
>> 
>> HBase makes it possible to keep  your table up to date in real time, but you 
>> have to pay an overhead cost at  query time.
>> 
>> On the other hand, with native Hive tables, there's latency  in loading new 
>> batches of data.
>> 
>> JVS
>> 
>> On Mar 7, 2011, at 10:13 PM,  Biju Kaimal wrote:
>> 
>>> Hi,
>>> 
>>> Could you please explain the  reason for the behavior? 
>>> 
>>> Regards,
>>> Biju
>>> 
>>> On Tue, Mar 8, 2011 at 11:35 AM, John Sichi <js...@fb.com>  wrote:
>>> Yes.
>>> 
>>> JVS
>>> 
>>> On Mar 7, 2011, at  9:59 PM, Biju Kaimal wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I loaded a data set which has 1 million rows into both Hive and HBase 
>> tables.  For the HBase table, I created a corresponding Hive table so that the 
>> data in  HBase can be queried from Hive QL. Both tables have a key column and a 
>> value  column
>>>> 
>>>> For the same query (select value, count(*) from  table group by value), the 
>> Hive only query runs much faster (~ 30 seconds) as  compared to Hive over HBase 
>> (~ 150 seconds).
>>>> 
>>>> Is this  expected?
>>>> 
>>>> Regards,
>>>> Biju
>>> 
>>> 
>> 
>> 


Re: Performance between Hive queries vs. Hive over HBase queries

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi,

John, are there plans or specific JIRA issues related to this particular 
performance hit that you or somebody else is working on and that those of us 
interested in performance improvements when Hive points to external tables in 
HBase should watch?

Thanks,
Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: John Sichi <js...@fb.com>
> To: "<us...@hive.apache.org>" <us...@hive.apache.org>
> Sent: Tue, March 8, 2011 1:17:51 AM
> Subject: Re: Performance between Hive queries vs. Hive over HBase queries
> 
> For native tables, Hive reads rows directly from HDFS.
> 
> For HBase tables,  it has to go through the HBase region servers, which 
>reconstruct rows from  column families (combining cache + HDFS).
> 
> HBase makes it possible to keep  your table up to date in real time, but you 
>have to pay an overhead cost at  query time.
> 
> On the other hand, with native Hive tables, there's latency  in loading new 
>batches of data.
> 
> JVS
> 
> On Mar 7, 2011, at 10:13 PM,  Biju Kaimal wrote:
> 
> > Hi,
> > 
> > Could you please explain the  reason for the behavior? 
> > 
> > Regards,
> > Biju
> > 
> > On Tue, Mar 8, 2011 at 11:35 AM, John Sichi <js...@fb.com>  wrote:
> > Yes.
> > 
> > JVS
> > 
> > On Mar 7, 2011, at  9:59 PM, Biju Kaimal wrote:
> > 
> > > Hi,
> > >
> > >  I loaded a data set which has 1 million rows into both Hive and HBase 
>tables.  For the HBase table, I created a corresponding Hive table so that the 
>data in  HBase can be queried from Hive QL. Both tables have a key column and a 
>value  column
> > >
> > > For the same query (select value, count(*) from  table group by value), the 
>Hive only query runs much faster (~ 30 seconds) as  compared to Hive over HBase 
>(~ 150 seconds).
> > >
> > > Is this  expected?
> > >
> > > Regards,
> > > Biju
> > 
> > 
> 
> 

Re: Performance between Hive queries vs. Hive over HBase queries

Posted by John Sichi <js...@fb.com>.
For native tables, Hive reads rows directly from HDFS.

For HBase tables, it has to go through the HBase region servers, which reconstruct rows from column families (combining cache + HDFS).

HBase makes it possible to keep your table up to date in real time, but you have to pay an overhead cost at query time.

On the other hand, with native Hive tables, there's latency in loading new batches of data.

JVS

On Mar 7, 2011, at 10:13 PM, Biju Kaimal wrote:

> Hi,
> 
> Could you please explain the reason for the behavior? 
> 
> Regards,
> Biju
> 
> On Tue, Mar 8, 2011 at 11:35 AM, John Sichi <js...@fb.com> wrote:
> Yes.
> 
> JVS
> 
> On Mar 7, 2011, at 9:59 PM, Biju Kaimal wrote:
> 
> > Hi,
> >
> > I loaded a data set which has 1 million rows into both Hive and HBase tables. For the HBase table, I created a corresponding Hive table so that the data in HBase can be queried from Hive QL. Both tables have a key column and a value column
> >
> > For the same query (select value, count(*) from table group by value), the Hive only query runs much faster (~ 30 seconds) as compared to Hive over HBase (~ 150 seconds).
> >
> > Is this expected?
> >
> > Regards,
> > Biju
> 
> 


Re: Performance between Hive queries vs. Hive over HBase queries

Posted by Biju Kaimal <bi...@kaimal.net>.
Hi,

Could you please explain the reason for the behavior?

Regards,
Biju

On Tue, Mar 8, 2011 at 11:35 AM, John Sichi <js...@fb.com> wrote:

> Yes.
>
> JVS
>
> On Mar 7, 2011, at 9:59 PM, Biju Kaimal wrote:
>
> > Hi,
> >
> > I loaded a data set which has 1 million rows into both Hive and HBase
> tables. For the HBase table, I created a corresponding Hive table so that
> the data in HBase can be queried from Hive QL. Both tables have a key column
> and a value column
> >
> > For the same query (select value, count(*) from table group by value),
> the Hive only query runs much faster (~ 30 seconds) as compared to Hive over
> HBase (~ 150 seconds).
> >
> > Is this expected?
> >
> > Regards,
> > Biju
>
>

Re: Performance between Hive queries vs. Hive over HBase queries

Posted by John Sichi <js...@fb.com>.
Yes.

JVS

On Mar 7, 2011, at 9:59 PM, Biju Kaimal wrote:

> Hi,
> 
> I loaded a data set which has 1 million rows into both Hive and HBase tables. For the HBase table, I created a corresponding Hive table so that the data in HBase can be queried from Hive QL. Both tables have a key column and a value column
> 
> For the same query (select value, count(*) from table group by value), the Hive only query runs much faster (~ 30 seconds) as compared to Hive over HBase (~ 150 seconds).
> 
> Is this expected?
> 
> Regards,
> Biju