You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Shushant Arora <sh...@gmail.com> on 2014/04/30 10:34:44 UTC

when to use hive vs hbase

I have a requirement of processing huge weblogs on daily basis.

1. data will come incremental to datastore on daily basis and I  need
cumulative and daily
distinct user count from logs and after that aggregated data will be loaded
in RDBMS like mydql.

2.data will be loaded in hdfs datawarehouse on daily basis and same will be
fetched from Hdfs warehouse after some filtering in RDMS like mysql and
will be processed there.

Which datawarehouse is suitable for approach 1 and 2 and why?.

Thanks
Shushant

Re: when to use hive vs hbase

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

Hi Shushant,

Have you looked at OpenTSDB? If you use timestamp in your rowkey you will
create what we call hotspots and you want to avoid that.OpenTSDB might help
you with that.

They key you propose will create Hotspot with default HBase version and you
want to avoid that. You can place the ID first but then you can not really
scan anymore. You can salt using a value between 0 and 9 in front of the
key but they you will need to do 10 more scans. So take a quick look at
OpenTSDB (it uses HBase) and see if it helps your usecase..

JM


2014-04-30 8:39 GMT-04:00 Shushant Arora <sh...@gmail.com>:

> Thanks Jean !
>
> Few more questions
> what are good practices for key column design in HBase?
> Say my web logs contains timestamp and request id which uniquely identify
> each row
>
> 1.Shall I make YYYY-MM-DD-HH-MM-SS_REQ_ID as row key ? In scenario where
> this data will be fetched from HBase on daily base and will be loaded in
> MYSql DB.
> Daily my ETLruns and it will fetch record with keycol>=lastdate and
> keycol<=today ? Will this key design over load one region server ? Or it
> will be equally divided among region servers.
>
>
>
>
>
>
> On Wed, Apr 30, 2014 at 5:55 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
> > With HBase you have some overhead. The Region Server will do a lot for
> you.
> > Manage lal the columns families, the columns, the delete marker, the
> > compactions, etc. If you read a file directly from HDFS it will be faster
> > for sure because you will not have all those validations and all this
> extra
> > memory usage.
> >
> > HBase is absolutely perfect and is excellent to what it's build for. But
> if
> > you are doing only full table scans, it's not it's primary usecase. It
> can
> > still do it if you want, but if you do only that, it's not yet the most
> > efficient option.
> >
> > If your usecase is a mix of full scans and random read/random writes,
> then
> > yes, go with it!
> >
> > Last, some full table scan can be good fits with HBase if you use some of
> > it's specific features like TTL on certain columns families when using
> more
> > than 1, etc.
> >
> > HTH
> >
> >
> > 2014-04-30 8:13 GMT-04:00 Shushant Arora <sh...@gmail.com>:
> >
> > > Hi Jean
> > >
> > > Thanks for explanation .
> > >
> > > I still  have one doubt
> > > Why HBase is not good for bulk loads and aggregations
> > > (Full table scan) ? Hive will also read each row for aggregation as
> well
> > as
> > > HBase .
> > > Can you explain more ?
> > >
> > >
> > > On Wed, Apr 30, 2014 at 5:15 PM, Jean-Marc Spaggiari <
> > > jean-marc@spaggiari.org> wrote:
> > >
> > > > Hi Shushant,
> > > >
> > > > Hive and HBase are 2 different things. You can not really use one vs
> > > > another one.
> > > >
> > > > Hive is a query engine against HDFS data. Data can be stored with
> > > different
> > > > format like flat text, sequence files, Paquet file, or even HBase
> > table.
> > > > HBase is both a query engine (Get and scans) and a storage engine on
> > top
> > > of
> > > > HDFS which allow you to store data for random read and random write.
> > > >
> > > > Then you can also add tools like Phoenix and Impala in the picture
> > which
> > > > will allow you to query the data from HDFS or HBase too.
> > > >
> > > > A good way to know if HBase is a good fit or not is to ask yourself
> how
> > > you
> > > > are going to write into HBase or to read from HBase. HBase is good
> for
> > > > Random Reads and Random Writes. If you only do bulk loads and
> > > aggregations
> > > > (Full table scan), HBase is not a good fit. If you do random access
> > > (Client
> > > > information, events details, etc.) HBase is a good fit.
> > > >
> > > > It's a bit over simplified, but that should give you some starting
> > > points.
> > > >
> > > >
> > > > 2014-04-30 4:34 GMT-04:00 Shushant Arora <shushantarora09@gmail.com
> >:
> > > >
> > > > > I have a requirement of processing huge weblogs on daily basis.
> > > > >
> > > > > 1. data will come incremental to datastore on daily basis and I
>  need
> > > > > cumulative and daily
> > > > > distinct user count from logs and after that aggregated data will
> be
> > > > loaded
> > > > > in RDBMS like mydql.
> > > > >
> > > > > 2.data will be loaded in hdfs datawarehouse on daily basis and same
> > > will
> > > > be
> > > > > fetched from Hdfs warehouse after some filtering in RDMS like mysql
> > and
> > > > > will be processed there.
> > > > >
> > > > > Which datawarehouse is suitable for approach 1 and 2 and why?.
> > > > >
> > > > > Thanks
> > > > > Shushant
> > > > >
> > > >
> > >
> >
>

Re: when to use hive vs hbase

Posted by Shushant Arora <sh...@gmail.com>.

Thanks Jean !

Few more questions
what are good practices for key column design in HBase?
Say my web logs contains timestamp and request id which uniquely identify
each row

1.Shall I make YYYY-MM-DD-HH-MM-SS_REQ_ID as row key ? In scenario where
this data will be fetched from HBase on daily base and will be loaded in
MYSql DB.
Daily my ETLruns and it will fetch record with keycol>=lastdate and
keycol<=today ? Will this key design over load one region server ? Or it
will be equally divided among region servers.






On Wed, Apr 30, 2014 at 5:55 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> With HBase you have some overhead. The Region Server will do a lot for you.
> Manage lal the columns families, the columns, the delete marker, the
> compactions, etc. If you read a file directly from HDFS it will be faster
> for sure because you will not have all those validations and all this extra
> memory usage.
>
> HBase is absolutely perfect and is excellent to what it's build for. But if
> you are doing only full table scans, it's not it's primary usecase. It can
> still do it if you want, but if you do only that, it's not yet the most
> efficient option.
>
> If your usecase is a mix of full scans and random read/random writes, then
> yes, go with it!
>
> Last, some full table scan can be good fits with HBase if you use some of
> it's specific features like TTL on certain columns families when using more
> than 1, etc.
>
> HTH
>
>
> 2014-04-30 8:13 GMT-04:00 Shushant Arora <sh...@gmail.com>:
>
> > Hi Jean
> >
> > Thanks for explanation .
> >
> > I still  have one doubt
> > Why HBase is not good for bulk loads and aggregations
> > (Full table scan) ? Hive will also read each row for aggregation as well
> as
> > HBase .
> > Can you explain more ?
> >
> >
> > On Wed, Apr 30, 2014 at 5:15 PM, Jean-Marc Spaggiari <
> > jean-marc@spaggiari.org> wrote:
> >
> > > Hi Shushant,
> > >
> > > Hive and HBase are 2 different things. You can not really use one vs
> > > another one.
> > >
> > > Hive is a query engine against HDFS data. Data can be stored with
> > different
> > > format like flat text, sequence files, Paquet file, or even HBase
> table.
> > > HBase is both a query engine (Get and scans) and a storage engine on
> top
> > of
> > > HDFS which allow you to store data for random read and random write.
> > >
> > > Then you can also add tools like Phoenix and Impala in the picture
> which
> > > will allow you to query the data from HDFS or HBase too.
> > >
> > > A good way to know if HBase is a good fit or not is to ask yourself how
> > you
> > > are going to write into HBase or to read from HBase. HBase is good for
> > > Random Reads and Random Writes. If you only do bulk loads and
> > aggregations
> > > (Full table scan), HBase is not a good fit. If you do random access
> > (Client
> > > information, events details, etc.) HBase is a good fit.
> > >
> > > It's a bit over simplified, but that should give you some starting
> > points.
> > >
> > >
> > > 2014-04-30 4:34 GMT-04:00 Shushant Arora <sh...@gmail.com>:
> > >
> > > > I have a requirement of processing huge weblogs on daily basis.
> > > >
> > > > 1. data will come incremental to datastore on daily basis and I  need
> > > > cumulative and daily
> > > > distinct user count from logs and after that aggregated data will be
> > > loaded
> > > > in RDBMS like mydql.
> > > >
> > > > 2.data will be loaded in hdfs datawarehouse on daily basis and same
> > will
> > > be
> > > > fetched from Hdfs warehouse after some filtering in RDMS like mysql
> and
> > > > will be processed there.
> > > >
> > > > Which datawarehouse is suitable for approach 1 and 2 and why?.
> > > >
> > > > Thanks
> > > > Shushant
> > > >
> > >
> >
>

Re: when to use hive vs hbase

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

With HBase you have some overhead. The Region Server will do a lot for you.
Manage lal the columns families, the columns, the delete marker, the
compactions, etc. If you read a file directly from HDFS it will be faster
for sure because you will not have all those validations and all this extra
memory usage.

HBase is absolutely perfect and is excellent to what it's build for. But if
you are doing only full table scans, it's not it's primary usecase. It can
still do it if you want, but if you do only that, it's not yet the most
efficient option.

If your usecase is a mix of full scans and random read/random writes, then
yes, go with it!

Last, some full table scan can be good fits with HBase if you use some of
it's specific features like TTL on certain columns families when using more
than 1, etc.

HTH


2014-04-30 8:13 GMT-04:00 Shushant Arora <sh...@gmail.com>:

> Hi Jean
>
> Thanks for explanation .
>
> I still  have one doubt
> Why HBase is not good for bulk loads and aggregations
> (Full table scan) ? Hive will also read each row for aggregation as well as
> HBase .
> Can you explain more ?
>
>
> On Wed, Apr 30, 2014 at 5:15 PM, Jean-Marc Spaggiari <
> jean-marc@spaggiari.org> wrote:
>
> > Hi Shushant,
> >
> > Hive and HBase are 2 different things. You can not really use one vs
> > another one.
> >
> > Hive is a query engine against HDFS data. Data can be stored with
> different
> > format like flat text, sequence files, Paquet file, or even HBase table.
> > HBase is both a query engine (Get and scans) and a storage engine on top
> of
> > HDFS which allow you to store data for random read and random write.
> >
> > Then you can also add tools like Phoenix and Impala in the picture which
> > will allow you to query the data from HDFS or HBase too.
> >
> > A good way to know if HBase is a good fit or not is to ask yourself how
> you
> > are going to write into HBase or to read from HBase. HBase is good for
> > Random Reads and Random Writes. If you only do bulk loads and
> aggregations
> > (Full table scan), HBase is not a good fit. If you do random access
> (Client
> > information, events details, etc.) HBase is a good fit.
> >
> > It's a bit over simplified, but that should give you some starting
> points.
> >
> >
> > 2014-04-30 4:34 GMT-04:00 Shushant Arora <sh...@gmail.com>:
> >
> > > I have a requirement of processing huge weblogs on daily basis.
> > >
> > > 1. data will come incremental to datastore on daily basis and I  need
> > > cumulative and daily
> > > distinct user count from logs and after that aggregated data will be
> > loaded
> > > in RDBMS like mydql.
> > >
> > > 2.data will be loaded in hdfs datawarehouse on daily basis and same
> will
> > be
> > > fetched from Hdfs warehouse after some filtering in RDMS like mysql and
> > > will be processed there.
> > >
> > > Which datawarehouse is suitable for approach 1 and 2 and why?.
> > >
> > > Thanks
> > > Shushant
> > >
> >
>

Re: when to use hive vs hbase

Posted by Shushant Arora <sh...@gmail.com>.

Hi Jean

Thanks for explanation .

I still  have one doubt
Why HBase is not good for bulk loads and aggregations
(Full table scan) ? Hive will also read each row for aggregation as well as
HBase .
Can you explain more ?


On Wed, Apr 30, 2014 at 5:15 PM, Jean-Marc Spaggiari <
jean-marc@spaggiari.org> wrote:

> Hi Shushant,
>
> Hive and HBase are 2 different things. You can not really use one vs
> another one.
>
> Hive is a query engine against HDFS data. Data can be stored with different
> format like flat text, sequence files, Paquet file, or even HBase table.
> HBase is both a query engine (Get and scans) and a storage engine on top of
> HDFS which allow you to store data for random read and random write.
>
> Then you can also add tools like Phoenix and Impala in the picture which
> will allow you to query the data from HDFS or HBase too.
>
> A good way to know if HBase is a good fit or not is to ask yourself how you
> are going to write into HBase or to read from HBase. HBase is good for
> Random Reads and Random Writes. If you only do bulk loads and aggregations
> (Full table scan), HBase is not a good fit. If you do random access (Client
> information, events details, etc.) HBase is a good fit.
>
> It's a bit over simplified, but that should give you some starting points.
>
>
> 2014-04-30 4:34 GMT-04:00 Shushant Arora <sh...@gmail.com>:
>
> > I have a requirement of processing huge weblogs on daily basis.
> >
> > 1. data will come incremental to datastore on daily basis and I  need
> > cumulative and daily
> > distinct user count from logs and after that aggregated data will be
> loaded
> > in RDBMS like mydql.
> >
> > 2.data will be loaded in hdfs datawarehouse on daily basis and same will
> be
> > fetched from Hdfs warehouse after some filtering in RDMS like mysql and
> > will be processed there.
> >
> > Which datawarehouse is suitable for approach 1 and 2 and why?.
> >
> > Thanks
> > Shushant
> >
>

Re: when to use hive vs hbase

Posted by Jean-Marc Spaggiari <je...@spaggiari.org>.

Hi Shushant,

Hive and HBase are 2 different things. You can not really use one vs
another one.

Hive is a query engine against HDFS data. Data can be stored with different
format like flat text, sequence files, Paquet file, or even HBase table.
HBase is both a query engine (Get and scans) and a storage engine on top of
HDFS which allow you to store data for random read and random write.

Then you can also add tools like Phoenix and Impala in the picture which
will allow you to query the data from HDFS or HBase too.

A good way to know if HBase is a good fit or not is to ask yourself how you
are going to write into HBase or to read from HBase. HBase is good for
Random Reads and Random Writes. If you only do bulk loads and aggregations
(Full table scan), HBase is not a good fit. If you do random access (Client
information, events details, etc.) HBase is a good fit.

It's a bit over simplified, but that should give you some starting points.


2014-04-30 4:34 GMT-04:00 Shushant Arora <sh...@gmail.com>:

> I have a requirement of processing huge weblogs on daily basis.
>
> 1. data will come incremental to datastore on daily basis and I  need
> cumulative and daily
> distinct user count from logs and after that aggregated data will be loaded
> in RDBMS like mydql.
>
> 2.data will be loaded in hdfs datawarehouse on daily basis and same will be
> fetched from Hdfs warehouse after some filtering in RDMS like mysql and
> will be processed there.
>
> Which datawarehouse is suitable for approach 1 and 2 and why?.
>
> Thanks
> Shushant
>

Re: when to use hive vs hbase

Posted by Shahab Yunus <sh...@gmail.com>.

HIve and HBase are 2 different tools/technologies. They are used together
but hey are not interchangeable.

HIve is for on-demand, RDMS SQL like data access while HBase is the actual
data store. Hive runs on HBase providing a on-demand, SQL like API.

Regards,
Shahab

On Wed, Apr 30, 2014 at 4:34 AM, Shushant Arora
<sh...@gmail.com>wrote:

> I have a requirement of processing huge weblogs on daily basis.
>
> 1. data will come incremental to datastore on daily basis and I  need
> cumulative and daily
> distinct user count from logs and after that aggregated data will be loaded
> in RDBMS like mydql.
>
> 2.data will be loaded in hdfs datawarehouse on daily basis and same will be
> fetched from Hdfs warehouse after some filtering in RDMS like mysql and
> will be processed there.
>
> Which datawarehouse is suitable for approach 1 and 2 and why?.
>
> Thanks
> Shushant
>

Re: when to use hive vs hbase

Posted by Shushant Arora <sh...@gmail.com>.

Mapping Existing Hbase table to Hive will be better or Creating direct Hive
tables will be better ?

I am reiterating 2 scenarios

I have a requirement of processing huge weblogs on daily basis.

Scenario 1. data will come incremental to datastore (containing
timestamp,userid,operation performed) on daily basis and I  need cumulative
and daily
distinct user count from logs and after that aggregated data will be loaded
in RDBMS like mysql.

Scenario 2.data will be loaded in hdfs datawarehouse on daily basis from
weblogs directory and same will be fetched from Hdfs warehouse  after some
filtering in RDMS (criteria of fetch will be date) like mysql and will be
processed in MySql.

Which datawarehouse Hive vs Hbase is suitable for approach 1 and 2 and why?.

Thanks

On Wed, Apr 30, 2014 at 4:01 PM, unmesha sreeveni <un...@gmail.com>wrote:

> HDFS lacks random read and write accees. This is where HBase comes into
> picture.It stores data as key/value pairs.
> Hive provides  data warehousing facilities on top of an existing Hadoop
> cluster. It provides an SQL like interface which makes your work easier.
> You can create tables in Hive and store data there. Along with that you can
> even map your existing HBase tables to Hive and operate on them.
>
>
> On Wed, Apr 30, 2014 at 2:04 PM, Shushant Arora <shushantarora09@gmail.com
> > wrote:
>
>> I have a requirement of processing huge weblogs on daily basis.
>>
>> 1. data will come incremental to datastore on daily basis and I  need
>> cumulative and daily
>> distinct user count from logs and after that aggregated data will be
>> loaded in RDBMS like mydql.
>>
>> 2.data will be loaded in hdfs datawarehouse on daily basis and same will
>> be fetched from Hdfs warehouse after some filtering in RDMS like mysql and
>> will be processed there.
>>
>> Which datawarehouse is suitable for approach 1 and 2 and why?.
>>
>> Thanks
>> Shushant
>>
>>
>
>
> --
> *Thanks & Regards *
>
>
> *Unmesha Sreeveni U.B*
> *Hadoop, Bigdata Developer*
> *Center for Cyber Security | Amrita Vishwa Vidyapeetham*
> http://www.unmeshasreeveni.blogspot.in/
>
>
>

Re: when to use hive vs hbase

Posted by unmesha sreeveni <un...@gmail.com>.

HDFS lacks random read and write accees. This is where HBase comes into
picture.It stores data as key/value pairs.
Hive provides  data warehousing facilities on top of an existing Hadoop
cluster. It provides an SQL like interface which makes your work easier.
You can create tables in Hive and store data there. Along with that you can
even map your existing HBase tables to Hive and operate on them.

On Wed, Apr 30, 2014 at 2:04 PM, Shushant Arora
<sh...@gmail.com>wrote:

> I have a requirement of processing huge weblogs on daily basis.
>
> 1. data will come incremental to datastore on daily basis and I  need
> cumulative and daily
> distinct user count from logs and after that aggregated data will be
> loaded in RDBMS like mydql.
>
> 2.data will be loaded in hdfs datawarehouse on daily basis and same will
> be fetched from Hdfs warehouse after some filtering in RDMS like mysql and
> will be processed there.
>
> Which datawarehouse is suitable for approach 1 and 2 and why?.
>
> Thanks
> Shushant
>
>

-- 
*Thanks & Regards *

*Unmesha Sreeveni U.B*
*Hadoop, Bigdata Developer*
*Center for Cyber Security | Amrita Vishwa Vidyapeetham*
http://www.unmeshasreeveni.blogspot.in/