You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Max Grigoriev <da...@gmail.com> on 2008/04/28 23:57:36 UTC

Is HBase suitable for ...

Hi there,

I'm making research to find right solution for our needs.
We need persistent layer for groups of social network.
These groups will have big amount of data ( ~100 GB) - users profiles, their
activities and etc.
And all job with these entities should be make online - user can ask to
unsubscribe him, or connect another users to him.
So we'll work with small pieces of big dataset not big data in offline -
like log parser.
We wants to have ability to make search of different table attributes and of
course scalability and failover.
We need easy add/remove nodes in cluster without stopping entire system.

All of this can be done with Amazon SimpleDB but we don't want to depend on
external service. That's why we're looking for some 3d product.

We have such candidates:

   - HBase -
   - CouchDb
   - HyperTable
   - Own bicycle

Can you tell me is HBase will work for such system?
If we have 2 or 3 data centers and we loose connection between them - what
behavior of HBase will we see ?
And when we restore connection in 1-2 hours - what should we expect from
HBase ?


Thank you.

Re: Is HBase suitable for ...

Posted by Bryan Duxbury <br...@rapleaf.com>.

My replies and questions inline.

On Apr 28, 2008, at 2:57 PM, Max Grigoriev wrote:

> Hi there,
>
> I'm making research to find right solution for our needs.
> We need persistent layer for groups of social network.
> These groups will have big amount of data ( ~100 GB) - users  
> profiles, their
> activities and etc.
100GB per group, or 100GB overall? How many groups?

> And all job with these entities should be make online - user can  
> ask to
> unsubscribe him, or connect another users to him.
> So we'll work with small pieces of big dataset not big data in  
> offline -
> like log parser.
> We wants to have ability to make search of different table  
> attributes and of
> course scalability and failover.
What kind of search on different table attributes do you want to do?  
There are no general purpose secondary indexes in HBase, so you  
either have to do a full- or partial-table scan or put the search  
attribute in the primary key.

As far as failover, at the moment, HBase has good recovery for region  
servers, and no recovery for the master. That's something we're  
hoping to change in the future.

> We need easy add/remove nodes in cluster without stopping entire  
> system.
You can do this, and it's not that hard.

>
> All of this can be done with Amazon SimpleDB but we don't want to  
> depend on
> external service. That's why we're looking for some 3d product.
>
> We have such candidates:
>
>    - HBase -
>    - CouchDb
>    - HyperTable
>    - Own bicycle
>
> Can you tell me is HBase will work for such system?
I think HBase can do what you need, but it'd be nice to have more  
details about what exactly you're going to do with it.

> If we have 2 or 3 data centers and we loose connection between them  
> - what
> behavior of HBase will we see ?
Is your intent to run a single HBase instance across several data  
centers? At the moment, if a regionserver is cut off from the master,  
it will kill itself. This means that if you have your master at one  
location and regionservers at another, and you lose connectivity,  
your regionservers at the other locations will shut themselves down.  
There are solutions to this we've discussed in the past. However, I  
wonder if maybe the correct solution is not to partition across data  
centers. It's not something that we've discussed at great length yet,  
so there might be an easier way to do it than I'm thinking.

> And when we restore connection in 1-2 hours - what should we expect  
> from
> HBase ?
This is where things would get sticky - how do you resolve conflicts  
in how data is being served, or worse, how it was split into regions?  
It seems inherently complicated and unpleasant.

>
>
> Thank you.

Re: Is HBase suitable for ...

Posted by Max Grigoriev <da...@gmail.com>.

Understood.
Need to sit down and relax and find the right way :)

On 4/29/08, Jim Kellerman <ji...@powerset.com> wrote:
>
> see comments in line below:
>
>

RE: Is HBase suitable for ...

Posted by Jim Kellerman <ji...@powerset.com>.

see comments in line below:

---
Jim Kellerman, Senior Engineer; Powerset

> -----Original Message-----
> From: Max Grigoriev [mailto:darkit@gmail.com]
> Sent: Tuesday, April 29, 2008 3:51 AM
> To: hbase-user@hadoop.apache.org
> Subject: Re: Is HBase suitable for ...
>
> Replies and questions inline.
>
>
> >
> > On Apr 28, 2008, at 2:57 PM, Max Grigoriev wrote:
> >
> >
> > What kind of search on different table attributes do you want to do?
> > There are no general purpose secondary indexes in HBase, so
> you either
> > have to do a full- or partial-table scan or put the search
> attribute
> > in the primary key.
> >
>
> The system is the core of different social networks so it
> should be able to make search on every attribute.
> Because during core development you don't know all entities
> and all search queries. So I think to use hibernate mapping
> (no relations - many-to-one and etc... just single
> attributes) where user can describe entity and if this entity
> is index. And in this case system will create secondary index.
> As HBase doesn't support secondary indexes , I think I'll be
> able to emulate them just creating thme by hands secondary
> index -> primary index as it's done in Berkeley DB for example.
>

Currently,this will result in a random read for the other record which
HBase does not do well. See
http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation for more
information.

> As far as failover, at the moment, HBase has good recovery for region
> > servers, and no recovery for the master. That's something
> we're hoping
> > to change in the future.
> >
>
> Is that future near or far ? Can I create new master in case
> of initial master failure?  Can master have slaves?

It will be atleast a couple of months before we get around to doing
master failover.

You can create a new master in case of master failure, but you will
have to restart HBase in order for the regionservers to find it.

Masters cannot have slaves currently.

> > Can you tell me is HBase will work for such system?
> > I think HBase can do what you need, but it'd be nice to have more
> > details about what exactly you're going to do with it.
> >
> i don't know :) because aplication developer will decide what
> entities and what they do. What I have to do is to create
> enviroment for easy creation of applications.
>
>
> > If we have 2 or 3 data centers and we loose connection between them
> > > - what
> > > behavior of HBase will we see ?
> > Is your intent to run a single HBase instance across several data
> > centers?
> >
>
> Yes, because you don't know which datacenter can be down.

Neither HBase nor Bigtable are designed to span data centers because
of latency and network partitioning issues. What Google does with Bigtable
is run a Bigtable cluster with the same data in each data center and
then stream updates between clusters. What you get is a sort of eventual
consistancy, but there are no guarantees about simultaneous (or nearly
simultaneous) updates to the same row. (nearly simultaneous updates
are those that occur within the replication latency window).

HBase will do similar replication in the future, but that will come
after master failover support.

It should also be noted that you would have to run Hadoop DFS split
across data centers in order to run HBase split across data centers.
Hadoop does not support this mode of operation either. (Nor does
Google's GFS)

> > And when we restore connection in 1-2 hours - what should we expect
> > > from
> > > HBase ?
> > This is where things would get sticky - how do you resolve
> conflicts
> > in how data is being served, or worse, how it was split
> into regions?
> > It seems inherently complicated and unpleasant.
> >
> >
> > You can update all records of restored node by update timestamp.

As noted above, splitting a cluster across data centers just will not work.

When replication is implemented, and connection is restored, the clusters
will stream updates to each other that occurred during the network
partitioning and will eventually reach a consistent state.

No virus found in this outgoing message.
Checked by AVG.
Version: 7.5.524 / Virus Database: 269.23.6/1403 - Release Date: 4/29/2008 7:26 AM

Re: Is HBase suitable for ...

Posted by Max Grigoriev <da...@gmail.com>.

Replies and questions inline.

>
> On Apr 28, 2008, at 2:57 PM, Max Grigoriev wrote:
>
>
> What kind of search on different table attributes do you want to do?
> There are no general purpose secondary indexes in HBase, so you
> either have to do a full- or partial-table scan or put the search
> attribute in the primary key.
>

The system is the core of different social networks so it should be
able to make search on every attribute.
Because during core development you don't know all entities and all search
queries. So I think to use hibernate
mapping (no relations - many-to-one and etc... just single attributes) where
user can describe
entity and if this entity is index. And in this case system will
create secondary index.
As HBase doesn't support secondary indexes , I think I'll be able to emulate
them just creating thme by hands secondary index -> primary index as it's
done in Berkeley DB for example.

As far as failover, at the moment, HBase has good recovery for region
> servers, and no recovery for the master. That's something we're
> hoping to change in the future.
>

Is that future near or far ? Can I create new master in case of initial
master failure?  Can master have slaves?

> Can you tell me is HBase will work for such system?
> I think HBase can do what you need, but it'd be nice to have more
> details about what exactly you're going to do with it.
>
i don't know :) because aplication developer will decide what entities and
what they do. What I have to do is to create enviroment for easy creation of
applications.

> If we have 2 or 3 data centers and we loose connection between them
> > - what
> > behavior of HBase will we see ?
> Is your intent to run a single HBase instance across several data
> centers?
>

Yes, because you don't know which datacenter can be down.

At the moment, if a regionserver is cut off from the master,
> it will kill itself. This means that if you have your master at one
> location and regionservers at another, and you lose connectivity,
> your regionservers at the other locations will shut themselves down.
> There are solutions to this we've discussed in the past. However, I
> wonder if maybe the correct solution is not to partition across data
> centers. It's not something that we've discussed at great length yet,
> so there might be an easier way to do it than I'm thinking.
>

If one datacenter goes down and it holds unique data  then you can't
continue to work. It's bad. So it's better to have data in both datacenter
and if one of them is dead, you can continue to work.

> And when we restore connection in 1-2 hours - what should we expect
> > from
> > HBase ?
> This is where things would get sticky - how do you resolve conflicts
> in how data is being served, or worse, how it was split into regions?
> It seems inherently complicated and unpleasant.
>
>
> You can update all records of restored node by update timestamp.