You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Bruce Bian <we...@gmail.com> on 2011/12/21 08:14:57 UTC

About performance issue of Hive/HBase vs Hive/HDFS

Hi there,
After I read these two posts on the mailing list
http://search-hadoop.com/m/nVaw59rFlY1/Performance+between+Hive+queries+vs.+Hive+over+HBase+queries&subj=Performance+between+Hive+queries+vs+Hive+over+HBase+queries
http://search-hadoop.com/m/X1rzQ1QDSaf2/Hive%252BHBase+performance+is+much+poorer+than+Hive%252BHDFS&subj=Hive+HBase+performance+is+much+poorer+than+Hive+HDFS
Seems like a 4~5X performance downgrade of Hive/HBase vs Hive/HDFS is
expected due to hbase built another layer on top of HDFS. If this is the
issue here, is it possible to bypass the HBase layer to read the HFiles
stored on HDFS directly?
Another possibility maybe the fact that for the same table, the storage is
much larger in HBase(around 5X in my test case, both uncompressed)than in
Hive, as hbase stores each KV pair for one column which causes the key to
be repeated several times. But after I tried compress the Hbase table using
LZO(now nearly the same as in hive uncompressed table), there's no
performance gain for queries like select count(*) from xtable;
Is there anyone working on this?Not sure whether I should put this post to
Hive's mailing list but there seems to be no progress on issues like
https://issues.apache.org/jira/browse/HIVE-1231

Regards,
Bruce

Re: About performance issue of Hive/HBase vs Hive/HDFS

Posted by Andrew Purtell <ap...@apache.org>.
I suspect some significant overhead is due to the storage engine implementation.

Predicate pushdown from Hive to the Hive HBase storage engine is suboptimal. This causes HBase to do more work than is necessary to satisfy the Hive side query. There is substantial work that could be done here to improve it. HIVE-1643 addressed some of this. You might ask Sandy Pratt about it. 

There are a lot of conversions from byte[] to String and back, would be good to reduce or eliminate this. HIVE-2380 can help but Hive's HBase storage engine would need to take advantage of the new datatype (in Hive 0.9+).

I'm sure there is more here. I've only dabbled in this area.

Best regards,


   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)


----- Original Message -----
> From: Bruce Bian <we...@gmail.com>
> To: user@hbase.apache.org
> Cc: 
> Sent: Wednesday, December 21, 2011 5:32 AM
> Subject: Re: About performance issue of Hive/HBase vs Hive/HDFS
> 
> Hi Michel,
> Maybe I missed something, but that's what was said in those two posts and
> also the results I've got so far when I was doing my own tests.
> So as for tuning HBase, after ensuring data locality, using scanner caching
> and turning off block caching, what are other configs I should pay
> attention to, any tips?
> Yeah,I'm happy to give snappy a shot.
> 
> Regards,
> Bruce
> 
> On Wed, Dec 21, 2011 at 8:52 PM, Michel Segel 
> <mi...@hotmail.com>wrote:
> 
>>  Ok... Just my random thoughts...
>>  There definitely is overhead in HBase that doesn't exist when you are
>>  doing direct access against a hive table. 4 to 5 times slower? I'd 
> question
>>  how you tuned your HBase.
>> 
>>  Having said that, I would imagine that there are still some potential
>>  improvements that could be done on hive to work better w HBase.
>>  Also why LZO and not Snappy?
>> 
>> 
>>  Sent from a remote device. Please excuse any typos...
>> 
>>  Mike Segel
>> 
>>  On Dec 21, 2011, at 1:14 AM, Bruce Bian <we...@gmail.com> 
> wrote:
>> 
>>  > Hi there,
>>  > After I read these two posts on the mailing list
>>  >
>> 
> http://search-hadoop.com/m/nVaw59rFlY1/Performance+between+Hive+queries+vs.+Hive+over+HBase+queries&subj=Performance+between+Hive+queries+vs+Hive+over+HBase+queries
>>  >
>> 
> http://search-hadoop.com/m/X1rzQ1QDSaf2/Hive%252BHBase+performance+is+much+poorer+than+Hive%252BHDFS&subj=Hive+HBase+performance+is+much+poorer+than+Hive+HDFS
>>  > Seems like a 4~5X performance downgrade of Hive/HBase vs Hive/HDFS is
>>  > expected due to hbase built another layer on top of HDFS. If this is 
> the
>>  > issue here, is it possible to bypass the HBase layer to read the 
> HFiles
>>  > stored on HDFS directly?
>>  > Another possibility maybe the fact that for the same table, the 
> storage
>>  is
>>  > much larger in HBase(around 5X in my test case, both uncompressed)than 
> in
>>  > Hive, as hbase stores each KV pair for one column which causes the key 
> to
>>  > be repeated several times. But after I tried compress the Hbase table
>>  using
>>  > LZO(now nearly the same as in hive uncompressed table), there's no
>>  > performance gain for queries like select count(*) from xtable;
>>  > Is there anyone working on this?Not sure whether I should put this 
> post
>>  to
>>  > Hive's mailing list but there seems to be no progress on issues 
> like
>>  > https://issues.apache.org/jira/browse/HIVE-1231
>>  >
>>  > Regards,
>>  > Bruce
>> 
> 

Re: About performance issue of Hive/HBase vs Hive/HDFS

Posted by Michel Segel <mi...@hotmail.com>.
Hey Bruce, 
There's a bit more... 
Setting up the region size to help minimize the number of regions,  then there is tuning the GC and also setting up mslabs.

You can check out Todd L.'s blog posts on Cloudera's website. I think he has both topics covered in blogs.

Doug may also have updated the HBase book too.

Sent from a remote device. Please excuse any typos...

Mike Segel

On Dec 21, 2011, at 7:32 AM, Bruce Bian <we...@gmail.com> wrote:

> Hi Michel,
> Maybe I missed something, but that's what was said in those two posts and
> also the results I've got so far when I was doing my own tests.
> So as for tuning HBase, after ensuring data locality, using scanner caching
> and turning off block caching, what are other configs I should pay
> attention to, any tips?
> Yeah,I'm happy to give snappy a shot.
> 
> Regards,
> Bruce
> 
> On Wed, Dec 21, 2011 at 8:52 PM, Michel Segel <mi...@hotmail.com>wrote:
> 
>> Ok... Just my random thoughts...
>> There definitely is overhead in HBase that doesn't exist when you are
>> doing direct access against a hive table. 4 to 5 times slower? I'd question
>> how you tuned your HBase.
>> 
>> Having said that, I would imagine that there are still some potential
>> improvements that could be done on hive to work better w HBase.
>> Also why LZO and not Snappy?
>> 
>> 
>> Sent from a remote device. Please excuse any typos...
>> 
>> Mike Segel
>> 
>> On Dec 21, 2011, at 1:14 AM, Bruce Bian <we...@gmail.com> wrote:
>> 
>>> Hi there,
>>> After I read these two posts on the mailing list
>>> 
>> http://search-hadoop.com/m/nVaw59rFlY1/Performance+between+Hive+queries+vs.+Hive+over+HBase+queries&subj=Performance+between+Hive+queries+vs+Hive+over+HBase+queries
>>> 
>> http://search-hadoop.com/m/X1rzQ1QDSaf2/Hive%252BHBase+performance+is+much+poorer+than+Hive%252BHDFS&subj=Hive+HBase+performance+is+much+poorer+than+Hive+HDFS
>>> Seems like a 4~5X performance downgrade of Hive/HBase vs Hive/HDFS is
>>> expected due to hbase built another layer on top of HDFS. If this is the
>>> issue here, is it possible to bypass the HBase layer to read the HFiles
>>> stored on HDFS directly?
>>> Another possibility maybe the fact that for the same table, the storage
>> is
>>> much larger in HBase(around 5X in my test case, both uncompressed)than in
>>> Hive, as hbase stores each KV pair for one column which causes the key to
>>> be repeated several times. But after I tried compress the Hbase table
>> using
>>> LZO(now nearly the same as in hive uncompressed table), there's no
>>> performance gain for queries like select count(*) from xtable;
>>> Is there anyone working on this?Not sure whether I should put this post
>> to
>>> Hive's mailing list but there seems to be no progress on issues like
>>> https://issues.apache.org/jira/browse/HIVE-1231
>>> 
>>> Regards,
>>> Bruce
>> 

Re: About performance issue of Hive/HBase vs Hive/HDFS

Posted by Bruce Bian <we...@gmail.com>.
Hi Michel,
Maybe I missed something, but that's what was said in those two posts and
also the results I've got so far when I was doing my own tests.
So as for tuning HBase, after ensuring data locality, using scanner caching
and turning off block caching, what are other configs I should pay
attention to, any tips?
Yeah,I'm happy to give snappy a shot.

Regards,
Bruce

On Wed, Dec 21, 2011 at 8:52 PM, Michel Segel <mi...@hotmail.com>wrote:

> Ok... Just my random thoughts...
> There definitely is overhead in HBase that doesn't exist when you are
> doing direct access against a hive table. 4 to 5 times slower? I'd question
> how you tuned your HBase.
>
> Having said that, I would imagine that there are still some potential
> improvements that could be done on hive to work better w HBase.
> Also why LZO and not Snappy?
>
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On Dec 21, 2011, at 1:14 AM, Bruce Bian <we...@gmail.com> wrote:
>
> > Hi there,
> > After I read these two posts on the mailing list
> >
> http://search-hadoop.com/m/nVaw59rFlY1/Performance+between+Hive+queries+vs.+Hive+over+HBase+queries&subj=Performance+between+Hive+queries+vs+Hive+over+HBase+queries
> >
> http://search-hadoop.com/m/X1rzQ1QDSaf2/Hive%252BHBase+performance+is+much+poorer+than+Hive%252BHDFS&subj=Hive+HBase+performance+is+much+poorer+than+Hive+HDFS
> > Seems like a 4~5X performance downgrade of Hive/HBase vs Hive/HDFS is
> > expected due to hbase built another layer on top of HDFS. If this is the
> > issue here, is it possible to bypass the HBase layer to read the HFiles
> > stored on HDFS directly?
> > Another possibility maybe the fact that for the same table, the storage
> is
> > much larger in HBase(around 5X in my test case, both uncompressed)than in
> > Hive, as hbase stores each KV pair for one column which causes the key to
> > be repeated several times. But after I tried compress the Hbase table
> using
> > LZO(now nearly the same as in hive uncompressed table), there's no
> > performance gain for queries like select count(*) from xtable;
> > Is there anyone working on this?Not sure whether I should put this post
> to
> > Hive's mailing list but there seems to be no progress on issues like
> > https://issues.apache.org/jira/browse/HIVE-1231
> >
> > Regards,
> > Bruce
>

Re: About performance issue of Hive/HBase vs Hive/HDFS

Posted by Michel Segel <mi...@hotmail.com>.
Ok... Just my random thoughts...
There definitely is overhead in HBase that doesn't exist when you are doing direct access against a hive table. 4 to 5 times slower? I'd question how you tuned your HBase.

Having said that, I would imagine that there are still some potential improvements that could be done on hive to work better w HBase.
Also why LZO and not Snappy?


Sent from a remote device. Please excuse any typos...

Mike Segel

On Dec 21, 2011, at 1:14 AM, Bruce Bian <we...@gmail.com> wrote:

> Hi there,
> After I read these two posts on the mailing list
> http://search-hadoop.com/m/nVaw59rFlY1/Performance+between+Hive+queries+vs.+Hive+over+HBase+queries&subj=Performance+between+Hive+queries+vs+Hive+over+HBase+queries
> http://search-hadoop.com/m/X1rzQ1QDSaf2/Hive%252BHBase+performance+is+much+poorer+than+Hive%252BHDFS&subj=Hive+HBase+performance+is+much+poorer+than+Hive+HDFS
> Seems like a 4~5X performance downgrade of Hive/HBase vs Hive/HDFS is
> expected due to hbase built another layer on top of HDFS. If this is the
> issue here, is it possible to bypass the HBase layer to read the HFiles
> stored on HDFS directly?
> Another possibility maybe the fact that for the same table, the storage is
> much larger in HBase(around 5X in my test case, both uncompressed)than in
> Hive, as hbase stores each KV pair for one column which causes the key to
> be repeated several times. But after I tried compress the Hbase table using
> LZO(now nearly the same as in hive uncompressed table), there's no
> performance gain for queries like select count(*) from xtable;
> Is there anyone working on this?Not sure whether I should put this post to
> Hive's mailing list but there seems to be no progress on issues like
> https://issues.apache.org/jira/browse/HIVE-1231
> 
> Regards,
> Bruce