You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "kulkarni.swarnim@gmail.com" <ku...@gmail.com> on 2015/05/14 16:16:41 UTC

Re: Questions related to HBase general use

+ hive-dev

Thanks for your question. We recently have been busy adding quite a few
features on top on Hive/HBase Integration to make it more stable and easy
to use. We also did a talk very recently at HBaseCon 2015 showing off the
latest improvements. Slides here[1]. Like Jerry mentioned, if you run a
regular query from Hive on an HBase table with billions of rows, it is
going to be slow as it would trigger a full table scan. However, Hive has
smarts around filter pushdown where the attributes in a "where" clause are
pushed down and converted to scan ranges and filters to optimize the scan.
Plus with the recent Hive On Spark uplift, I see this integration take
benefit of that as well.

That said, we here use this integration daily over billions of rows to run
hundreds of queries without any issues. Since you mentioned that you are a
already a big consumer of Hive, I would highly recommend to give this a
spin and report back with whatever issues you face so we can work on making
this more stable.

Hope that helps.

Swarnim

[1]
https://docs.google.com/presentation/d/1K2A2NMsNbmKWuG02aUDxsLo0Lal0lhznYy8SB6HjC9U/edit#slide=id.p

On Wed, May 13, 2015 at 6:26 PM, Nick Dimiduk <nd...@gmail.com> wrote:

> + Swarnim, who's expert on HBase/Hive integration.
>
> Yes, snapshots may be interesting for you. I believe Hive can access HBase
> timestamps, exposed as a "virtual" column. It's assumed across there whole
> row however, not per cell.
>
> On Sun, May 10, 2015 at 9:14 PM, Jerry He <je...@gmail.com> wrote:
>
>> Hi, Yong
>>
>> You have a good understanding of the benefit of HBase already.
>> Generally speaking, HBase is suitable for real time read/write to your big
>> data set.
>> Regarding the HBase performance evaluation tool, the 'read' test use HBase
>> 'get'. For 1m rows, the test would issue 1m 'get' (and RPC) to the server.
>> The 'scan' test scans the table and transfers the rows to the client in
>> batches (e.g. 100 rows at a time), which will take shorter time for the
>> whole test to complete for the same number of rows.
>> The hive/hbase integration, as you said, needs more consideration.
>> 1) The performance.  Hive access HBase via HBase client API, which
>> involves
>> going to the HBase server for all the data access. This will slow things
>> down.
>>     There are a couple of things you can explore. e.g. Hive/HBase snapshot
>> integration. This would provide direct access to HBase hfiles.
>> 2) In your email, you are interested in HBase's capability of storing
>> multiple versions of data.  You need to consider if Hive supports this
>> HBase feature. i.e provide you access to multi versions. As I can
>> remember,
>> it is not fully.
>>
>> Jerry
>>
>>
>> On Thu, May 7, 2015 at 6:18 PM, java8964 <ja...@hotmail.com> wrote:
>>
>> > Hi,
>> > I am kind of new to HBase. Currently our production run IBM BigInsight
>> V3,
>> > comes with Hadoop 2.2 and HBase 0.96.0.
>> > We are mostly using HDFS and Hive/Pig for our BigData project, it works
>> > very good for our big datasets. Right now, we have a one dataset needs
>> to
>> > be loaded from Mysql, about 100G, and will have about Gs change daily.
>> This
>> > is a very important slow change dimension data, we like to sync between
>> > Mysql and BigData platform.
>> > I am thinking of using HBase to store it, instead of refreshing the
>> whole
>> > dataset in HDFS, due to:
>> > 1) HBase makes the merge the change very easy.2) HBase could store all
>> the
>> > changes in the history, as a function out of box. We will replicate all
>> the
>> > changes from the binlog level from Mysql, and we could keep all changes
>> in
>> > HBase (or long history), then it can give us some insight that cannot be
>> > done easily in HDFS.3) HBase could give us the benefit to access the
>> data
>> > by key fast, for some cases.4) HBase is available out of box.
>> > What I am not sure is the Hive/HBase integration. Hive is the top tool
>> in
>> > our environment. If one dataset stored in Hbase (even only about 100G as
>> > now), the join between it with the other Big datasets in HDFS worries
>> me. I
>> > read quite some information about Hive/HBase integration, and feel that
>> it
>> > is not really mature, as not too many usage cases I can find online,
>> > especially on performance. There are quite some JIRAs related to make
>> Hive
>> > utilize the HBase for performance in MR job are still pending.
>> > I want to know other people experience to use HBase in this way. I
>> > understand HBase is not designed as a storage system for Data Warehouse
>> > component or analytics engine. But the benefits to use HBase in this
>> case
>> > still attractive me. If my use cases of HBase is mostly read or full
>> scan
>> > the data, how bad it is compared to HDFS in the same cluster? 3x? 5x?
>> > To help me understand the read throughput of HBase, I use the HBase
>> > performance evaluation tool, but the output is quite confusing. I have 2
>> > clusters, one is with 5 nodes with 3 slaves all running on VM (Each with
>> > 24G + 4 cores, so cluster has 12 mappers + 6 reducers), another is real
>> > cluster with 5 nodes with 3 slaves with 64G + 24 cores and with (48
>> mapper
>> > slots + 24 reducer slots).Below is the result I run the "sequentialRead
>> 3"
>> > on the better cluster:
>> > 15/05/07 17:26:50 INFO mapred.JobClient: Counters: 3015/05/07 17:26:50
>> > INFO mapred.JobClient:   File System Counters15/05/07 17:26:50 INFO
>> > mapred.JobClient:     FILE: BYTES_READ=54615/05/07 17:26:50 INFO
>> > mapred.JobClient:     FILE: BYTES_WRITTEN=742507415/05/07 17:26:50 INFO
>> > mapred.JobClient:     HDFS: BYTES_READ=270015/05/07 17:26:50 INFO
>> > mapred.JobClient:     HDFS: BYTES_WRITTEN=40515/05/07 17:26:50 INFO
>> > mapred.JobClient:   org.apache.hadoop.mapreduce.JobCounter15/05/07
>> 17:26:50
>> > INFO mapred.JobClient:     TOTAL_LAUNCHED_MAPS=3015/05/07 17:26:50 INFO
>> > mapred.JobClient:     TOTAL_LAUNCHED_REDUCES=115/05/07 17:26:50 INFO
>> > mapred.JobClient:     SLOTS_MILLIS_MAPS=290516715/05/07 17:26:50 INFO
>> > mapred.JobClient:     SLOTS_MILLIS_REDUCES=1134015/05/07 17:26:50 INFO
>> > mapred.JobClient:     FALLOW_SLOTS_MILLIS_MAPS=015/05/07 17:26:50 INFO
>> > mapred.JobClient:     FALLOW_SLOTS_MILLIS_REDUCES=015/05/07 17:26:50
>> INFO
>> > mapred.JobClient:   org.apache.hadoop.mapreduce.TaskCounter15/05/07
>> > 17:26:50 INFO mapred.JobClient:     MAP_INPUT_RECORDS=3015/05/07
>> 17:26:50
>> > INFO mapred.JobClient:     MAP_OUTPUT_RECORDS=3015/05/07 17:26:50 INFO
>> > mapred.JobClient:     MAP_OUTPUT_BYTES=48015/05/07 17:26:50 INFO
>> > mapred.JobClient:     MAP_OUTPUT_MATERIALIZED_BYTES=72015/05/07 17:26:50
>> > INFO mapred.JobClient:     SPLIT_RAW_BYTES=270015/05/07 17:26:50 INFO
>> > mapred.JobClient:     COMBINE_INPUT_RECORDS=015/05/07 17:26:50 INFO
>> > mapred.JobClient:     COMBINE_OUTPUT_RECORDS=015/05/07 17:26:50 INFO
>> > mapred.JobClient:     REDUCE_INPUT_GROUPS=3015/05/07 17:26:50 INFO
>> > mapred.JobClient:     REDUCE_SHUFFLE_BYTES=72015/05/07 17:26:50 INFO
>> > mapred.JobClient:     REDUCE_INPUT_RECORDS=3015/05/07 17:26:50 INFO
>> > mapred.JobClient:     REDUCE_OUTPUT_RECORDS=3015/05/07 17:26:50 INFO
>> > mapred.JobClient:     SPILLED_RECORDS=6015/05/07 17:26:50 INFO
>> > mapred.JobClient:     CPU_MILLISECONDS=163145015/05/07 17:26:50 INFO
>> > mapred.JobClient:     PHYSICAL_MEMORY_BYTES=1403188838415/05/07 17:26:50
>> > INFO mapred.JobClient:     VIRTUAL_MEMORY_BYTES=6413996032015/05/07
>> > 17:26:50 INFO mapred.JobClient:
>> >  COMMITTED_HEAP_BYTES=3382286745615/05/07 17:26:50 INFO
>> mapred.JobClient:
>> >  HBase Performance Evaluation15/05/07 17:26:50 INFO mapred.JobClient:
>> >  Elapsed time in milliseconds=248921715/05/07 17:26:50 INFO
>> > mapred.JobClient:     Row count=314571015/05/07 17:26:50 INFO
>> > mapred.JobClient:   File Input Format Counters15/05/07 17:26:50 INFO
>> > mapred.JobClient:     Bytes Read=015/05/07 17:26:50 INFO
>> mapred.JobClient:
>> >  org.apache.hadoop.mapreduce.lib.output.FileOutputFormat$Counter15/05/07
>> > 17:26:50 INFO mapred.JobClient:     BYTES_WRITTEN=405
>> > First, what is the through put I should get from the above result? Does
>> it
>> > mean 2489 seconds to sequential read 3.1G data (I assume every record is
>> > 1k)? So about 1.2M/s, which is very low compared to HDFS.  Here is  the
>> > output for scan operation on the same cluster:
>> > 15/05/07 17:32:46 INFO mapred.JobClient:   HBase Performance
>> > Evaluation15/05/07 17:32:46 INFO mapred.JobClient:     Elapsed time in
>> > milliseconds=38302115/05/07 17:32:46 INFO mapred.JobClient:     Row
>> > count=3145710
>> > Does it mean scanning 3.1G data with 383 seconds can be done on this
>> > cluster? What is the difference between scan and sequential read?
>> > Of course, all this tests are just done with default setting coming out
>> of
>> > box of HBase on BigInsight. I am trying to learn how to tune it. What I
>> am
>> > interested to know that for a N number of nodes of cluster, what is the
>> > reasonable read throughput I can expected?
>> > Thanks for your time.
>> > Yong
>>
>
>


-- 
Swarnim