You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by java8964 <ja...@hotmail.com> on 2014/02/10 19:56:20 UTC

Hive + Hbase scanning performance

Hi, 
I know this has been asked before. I did google around this topic and tried to understand as much as possible, but I kind of got difference answers based on different places. So I like to ask what I have faced and if someone can help me again on this topic.
I created one table with one column family with 20+ columns in the hive. It is populated around 150M records from a 20G csv file. What I want to check if how fast I can get for a full scan in MR job from the Hbase table.
It is running in a 10 nodes hadoop cluster (With Hadoop 1.1.1 + Hbase 0.94.3 + Hive 0.9) , 8 of them as Data + Task nodes, and one is NN and Hbase master, and another one is running 2nd NN.
4 nodes of 8 data nodes also run Hbase region servers.
I use the following code example to get row count from a MR job, http://hbase.apache.org/book/mapreduce.example.htmlAt first, the mapper tasks run very slow, as I commented out the following 2 lines on purpose:
scan.setCaching(1000);        // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false);  // don't set to true for MR jobs
Then I added the above 2 lines, I almost get 10X faster compared to the first run. That's good, it proved to me that above 2 lines are important for Hbase full scan.
Now the question comes to in Hive.
I already created the table in the Hive linking to the Hbase table, then I started my hive session like this:
hive --auxpath $HIVE_HOME/lib/hive-hbase-handler-0.9.0.jar,$HIVE_HOME/lib/hbase-0.94.3.jar,$HIVE_HOME/lib/zookeeper-3.4.5.jar,$HIVE_HOME/lib/guava-r09.jar -hiveconf hbase.master=Hbase_master:port
If I run this query "select count(*) from table", I can see the mappers performance is very bad, almost as bad as my 1st run above.
I searched this mailing list, it looks like there is a setting in Hive session to change the scan caching size, same as 1st line of above code base, from here:
http://mail-archives.apache.org/mod_mbox/hbase-user/201110.mbox/%3CCAGpTDNfn11jZAJ2mfboEqkfudXaU9HGsY4b=2x1spWf4qMUvyw@mail.gmail.com%3E
So I add the following settings in my hive session:
set hbase.client.scanner.caching=1000;
To my surprise, after this setting in hive session, the new MR job generated from the Hive query still very slow, same as before this settings.
Here is what I found so far:
1) In my owner MR code, before I add the 2 lines of code change or after, in the job.xml of MR job, I both saw this setting in the job.xml:     hbase.client.scanner.caching=1    So this setting is the same in both run, but the performance improved great after the code change.
2) In hive run, I saw the setting "hbase.client.scanner.caching" changed from 1 to 1000 in job.xml, which is what I set in the hive session, but performance has not too much change. So the setting was changed, but it didn't help the performance as I expected.
My questions are following:
1) Is there any change in the hive (0.9) do the same as the 1st line of code change? From google and hbase document, it looks like the above configuration is the one, but it didn't help me.2) Even assume the above setting is correct, why we have this Hive Jira to fix the Hbase scan cache and marked ONLY fixed in Hive 0.12? The Jira ticket is here: https://issues.apache.org/jira/browse/HIVE-36033) Is there any hive setting can do the same as 2nd line code change above? If so, what is it? I google around and cannot find one.
Thanks
Yong 		 	   		  

Re: Hive + Hbase scanning performance

Posted by Ted Yu <yu...@gmail.com>.
You can patch HIVE-3603 into your deployment so that you can make use of
scan.setCacheBlocks(false).

Cheers


On Mon, Feb 10, 2014 at 10:56 AM, java8964 <ja...@hotmail.com> wrote:

> Hi,
> I know this has been asked before. I did google around this topic and
> tried to understand as much as possible, but I kind of got difference
> answers based on different places. So I like to ask what I have faced and
> if someone can help me again on this topic.
> I created one table with one column family with 20+ columns in the hive.
> It is populated around 150M records from a 20G csv file. What I want to
> check if how fast I can get for a full scan in MR job from the Hbase table.
> It is running in a 10 nodes hadoop cluster (With Hadoop 1.1.1 + Hbase
> 0.94.3 + Hive 0.9) , 8 of them as Data + Task nodes, and one is NN and
> Hbase master, and another one is running 2nd NN.
> 4 nodes of 8 data nodes also run Hbase region servers.
> I use the following code example to get row count from a MR job,
> http://hbase.apache.org/book/mapreduce.example.htmlAt first, the mapper
> tasks run very slow, as I commented out the following 2 lines on purpose:
> scan.setCaching(1000);        // 1 is the default in Scan, which will be
> bad for MapReduce jobs
> scan.setCacheBlocks(false);  // don't set to true for MR jobs
> Then I added the above 2 lines, I almost get 10X faster compared to the
> first run. That's good, it proved to me that above 2 lines are important
> for Hbase full scan.
> Now the question comes to in Hive.
> I already created the table in the Hive linking to the Hbase table, then I
> started my hive session like this:
> hive --auxpath
> $HIVE_HOME/lib/hive-hbase-handler-0.9.0.jar,$HIVE_HOME/lib/hbase-0.94.3.jar,$HIVE_HOME/lib/zookeeper-3.4.5.jar,$HIVE_HOME/lib/guava-r09.jar
> -hiveconf hbase.master=Hbase_master:port
> If I run this query "select count(*) from table", I can see the mappers
> performance is very bad, almost as bad as my 1st run above.
> I searched this mailing list, it looks like there is a setting in Hive
> session to change the scan caching size, same as 1st line of above code
> base, from here:
>
> http://mail-archives.apache.org/mod_mbox/hbase-user/201110.mbox/%3CCAGpTDNfn11jZAJ2mfboEqkfudXaU9HGsY4b=2x1spWf4qMUvyw@mail.gmail.com%3E
> So I add the following settings in my hive session:
> set hbase.client.scanner.caching=1000;
> To my surprise, after this setting in hive session, the new MR job
> generated from the Hive query still very slow, same as before this settings.
> Here is what I found so far:
> 1) In my owner MR code, before I add the 2 lines of code change or after,
> in the job.xml of MR job, I both saw this setting in the job.xml:
> hbase.client.scanner.caching=1    So this setting is the same in both run,
> but the performance improved great after the code change.
> 2) In hive run, I saw the setting "hbase.client.scanner.caching" changed
> from 1 to 1000 in job.xml, which is what I set in the hive session, but
> performance has not too much change. So the setting was changed, but it
> didn't help the performance as I expected.
> My questions are following:
> 1) Is there any change in the hive (0.9) do the same as the 1st line of
> code change? From google and hbase document, it looks like the above
> configuration is the one, but it didn't help me.2) Even assume the above
> setting is correct, why we have this Hive Jira to fix the Hbase scan cache
> and marked ONLY fixed in Hive 0.12? The Jira ticket is here:
> https://issues.apache.org/jira/browse/HIVE-36033) Is there any hive
> setting can do the same as 2nd line code change above? If so, what is it? I
> google around and cannot find one.
> Thanks
> Yong