You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by java8964 <ja...@hotmail.com> on 2014/02/10 20:22:52 UTC

Hbase + Hive scan performance

Hi, 
I know this has been asked before. I did google around this topic and tried to understand as much as possible, but I kind of got difference answers based on different places. So I like to ask what I have faced and if someone can help me again on this topic.
I created one table with one column family with 20+ columns in the hive. It is populated around 150M records from a 20G csv file. What I want to check if how fast I can get for a full scan in MR job from the Hbase table.
It is running in a 10 nodes hadoop cluster (With Hadoop 1.1.1 + Hbase 0.94.3 + Hive 0.9) , 8 of them as Data + Task nodes, and one is NN and Hbase master, and another one is running 2nd NN.
4 nodes of 8 data nodes also run Hbase region servers.
I use the following code example to get row count from a MR job, http://hbase.apache.org/book/mapreduce.example.htmlAt first, the mapper tasks run very slow, as I commented out the following 2 lines on purpose:
scan.setCaching(1000);        // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false);  // don't set to true for MR jobs
Then I added the above 2 lines, I almost get 10X faster compared to the first run. That's good, it proved to me that above 2 lines are important for Hbase full scan.
Now the question comes to in Hive.
I already created the table in the Hive linking to the Hbase table, then I started my hive session like this:
hive --auxpath $HIVE_HOME/lib/hive-hbase-handler-0.9.0.jar,$HIVE_HOME/lib/hbase-0.94.3.jar,$HIVE_HOME/lib/zookeeper-3.4.5.jar,$HIVE_HOME/lib/guava-r09.jar -hiveconf hbase.master=Hbase_master:port
If I run this query "select count(*) from table", I can see the mappers performance is very bad, almost as bad as my 1st run above.
I searched this mailing list, it looks like there is a setting in Hive session to change the scan caching size, same as 1st line of above code base, from here:
http://mail-archives.apache.org/mod_mbox/hbase-user/201110.mbox/%3CCAGpTDNfn11jZAJ2mfboEqkfudXaU9HGsY4b=2x1spWf4qMUvyw@mail.gmail.com%3E
So I add the following settings in my hive session:
set hbase.client.scanner.caching=1000;
To my surprise, after this setting in hive session, the new MR job generated from the Hive query still very slow, same as before this settings.
Here is what I found so far:
1) In my owner MR code, before I add the 2 lines of code change or after, in the job.xml of MR job, I both saw this setting in the job.xml:     hbase.client.scanner.caching=1    So this setting is the same in both run, but the performance improved great after the code change.
2) In hive run, I saw the setting "hbase.client.scanner.caching" changed from 1 to 1000 in job.xml, which is what I set in the hive session, but performance has not too much change. So the setting was changed, but it didn't help the performance as I expected.
My questions are following:
1) Is there any change in the hive (0.9) do the same as the 1st line of code change? From google and hbase document, it looks like the above configuration is the one, but it didn't help me.2) Even assume the above setting is correct, why we have this Hive Jira to fix the Hbase scan cache and marked ONLY fixed in Hive 0.12? The Jira ticket is here: https://issues.apache.org/jira/browse/HIVE-36033) Is there any hive setting can do the same as 2nd line code change above? If so, what is it? I google around and cannot find one.
Thanks
Yong 		 	   		  

Re: Hbase + Hive scan performance

Posted by Navis류승우 <na...@nexr.com>.
HBase storage handler uses it's own InputFormat.
So, hbase.client.scanner.caching (which is used in hbase.TableInputFormat)
does not work. It might be configurable via HIVE-2906, something like
"select empno, ename from hbase_emp ('hbase.scan.cache'='1000')". But I've
not tried.

bq. Is there any change in the hive (0.9) do the same as..
It might not be.

bq. why we have this Hive Jira to fix the Hbase scan cache and marked ONLY
fixed in Hive 0.12..
Sorry for that. Hive is yet in rapidly evolving state, so generally
maintenance versions are not provided.

bq. hive setting can do the same as 2nd line code
It's configurable via "hbase.scan.cacheblock"

ps. I regret the name of the configuration should be identical with that of
hbase, but it's already done.

Thanks,



2014-02-11 4:22 GMT+09:00 java8964 <ja...@hotmail.com>:

> Hi,
>
> I know this has been asked before. I did google around this topic and
> tried to understand as much as possible, but I kind of got difference
> answers based on different places. So I like to ask what I have faced and
> if someone can help me again on this topic.
>
> I created one table with one column family with 20+ columns in the hive.
> It is populated around 150M records from a 20G csv file.
> What I want to check if how fast I can get for a full scan in MR job from
> the Hbase table.
>
> It is running in a 10 nodes hadoop cluster (With Hadoop 1.1.1 + Hbase
> 0.94.3 + Hive 0.9) , 8 of them as Data + Task nodes, and one is NN and
> Hbase master, and another one is running 2nd NN.
>
> 4 nodes of 8 data nodes also run Hbase region servers.
>
> I use the following code example to get row count from a MR job,
> http://hbase.apache.org/book/mapreduce.example.html
> At first, the mapper tasks run very slow, as I commented out the following
> 2 lines on purpose:
>
> scan.setCaching(1000);        // 1 is the default in Scan, which will be bad for MapReduce jobs
> scan.setCacheBlocks(false);  // don't set to true for MR jobs
>
>
> Then I added the above 2 lines, I almost get 10X faster compared to the
> first run. That's good, it proved to me that above 2 lines are important
> for Hbase full scan.
>
> Now the question comes to in Hive.
>
> I already created the table in the Hive linking to the Hbase table, then I
> started my hive session like this:
>
> hive --auxpath
> $HIVE_HOME/lib/hive-hbase-handler-0.9.0.jar,$HIVE_HOME/lib/hbase-0.94.3.jar,$HIVE_HOME/lib/zookeeper-3.4.5.jar,$HIVE_HOME/lib/guava-r09.jar
> -hiveconf hbase.master=Hbase_master:port
>
> If I run this query "select count(*) from table", I can see the mappers
> performance is very bad, almost as bad as my 1st run above.
>
> I searched this mailing list, it looks like there is a setting in Hive
> session to change the scan caching size, same as 1st line of above code
> base, from here:
>
>
> http://mail-archives.apache.org/mod_mbox/hbase-user/201110.mbox/%3CCAGpTDNfn11jZAJ2mfboEqkfudXaU9HGsY4b=2x1spWf4qMUvyw@mail.gmail.com%3E<http://mail-archives.apache.org/mod_mbox/hbase-user/201110.mbox/%3cCAGpTDNfn11jZAJ2mfboEqkfudXaU9HGsY4b=2x1spWf4qMUvyw%40mail.gmail.com%3e>
>
> So I add the following settings in my hive session:
>
> set hbase.client.scanner.caching=1000;
>
> To my surprise, after this setting in hive session, the new MR job
> generated from the Hive query still very slow, same as before this settings.
>
> Here is what I found so far:
>
> 1) In my owner MR code, before I add the 2 lines of code change or after,
> in the job.xml of MR job, I both saw this setting in the job.xml:
> hbase.client.scanner.caching=1
> So this setting is the same in both run, but the performance improved
> great after the code change.
>
> 2) In hive run, I saw the setting "hbase.client.scanner.caching" changed
> from 1 to 1000 in job.xml, which is what I set in the hive session, but
> performance has not too much change. So the setting was changed, but it
> didn't help the performance as I expected.
>
> My questions are following:
>
> 1) Is there any change in the hive (0.9) do the same as the 1st line of
> code change? From google and hbase document, it looks like the above
> configuration is the one, but it didn't help me.
> 2) Even assume the above setting is correct, why we have this Hive Jira to
> fix the Hbase scan cache and marked ONLY fixed in Hive 0.12? The Jira
> ticket is here: https://issues.apache.org/jira/browse/HIVE-3603
> 3) Is there any hive setting can do the same as 2nd line code change
> above? If so, what is it? I google around and cannot find one.
>
> Thanks
>
> Yong
>