You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by er...@yahoo.com on 2011/01/17 13:22:05 UTC

evaluating HBase

Hi,

I am currently evaluating HBase for an implementation of an ERP-like cloud
solution that's supposed to handle 500M lines per year for the biggest tenant
and 10-20m for the smaller tenants. I am writing a couple prototypes, one using
MySQL (sharded) and one with HBase - I will let you know what I find if you are
interested. Anyway, I have 2 questions:

The first one is regarding the following post and I would like to get a
perspective from the no-sql camp on this one.
http://www.quora.com/Why-does-Quora-use-MySQL-as-the-data-store-rather-than-NoSQLs-such-as-Cassandra-MongoDB-CouchDB-etc

The second is regarding how to best implement a 'duplicate check' validation.
Here is what I have done so far: I have a single entity table and I have
created an indexed table where the key is the concatenated value of the 4
attributes of the entity (these 4 attributes are the definition of what
constitutes a duplicate record while the entity can have around 100-150
different attributes). In this indexed table, I have a column in which I store
a comma delimited list of all the keys that corresponds to entities that have
the same set of 4 attribute values.

For example: (assuming that a dup is defined by entities having values of a and
b be the same)

EntityTable:
key, a, b, c, d, e
1, 1, 1, 1, 1, 1
2, 1, 1, 2, 2, 2
3, 1, 2, 2, 2, 2
4, 2, 2, 2, 2, 2

IndexTable:
key, value
11, [1, 2]
12, [3]
22, [4]

When I scan through my Entity table, I plan on looking up the index table by the
dup key and add the current entity key in it? I am worried about this look up
per entity record for performance reasons? To make things more complicated, I
should be able to change the set of keys that define a dup. I handle that by
recreating my index table.

Is there a better way to write a dup check?

Thanks a lot for your help,
-Eric

Re: evaluating HBase

Posted by Steven Noels <st...@outerthought.org>.

On Mon, Jan 17, 2011 at 2:53 PM, Thomas Koch <th...@koch.ro> wrote:

eric_bdr@yahoo.com:
> > Hi,
> >
> > I am currently evaluating HBase for an implementation of an ERP-like
> cloud
> > solution that's supposed to handle 500M lines per year for the biggest
> > tenant and 10-20m for the smaller tenants.
>


Hi Eric,
>
> have a look at lily:
> http://www.lilyproject.org
>
> Lily should become a scalable content repository for CMS'. But there's no
> reason why you couldn't use it for ERP solutions.
>


FWIW, we're currently doing a POC to validate the use of Lily as a product
(and more) catalogue underneath an e-commerce/retail platform.

Steven.
-- 
Steven Noels
http://outerthought.org/
Open Source Content Applications
Makers of Kauri, Daisy CMS and Lily

Re: evaluating HBase

Posted by Thomas Koch <th...@koch.ro>.

eric_bdr@yahoo.com:
> Hi,
> 
> I am currently evaluating HBase for an implementation of an ERP-like cloud
> solution that's supposed to handle 500M lines per year for the biggest
> tenant and 10-20m for the smaller tenants.  
Hi Eric,

have a look at lily:
http://www.lilyproject.org

Lily should become a scalable content repository for CMS'. But there's no 
reason why you couldn't use it for ERP solutions.
Keep us posted!

Best regards,

Thomas Koch, http://www.koch.ro

RE: evaluating HBase

Posted by "Abinash Karana (Bizosys)" <ab...@bizosys.com>.

Hi Eric,
The duplicate record problem is addressed by Nutch by designing a signature.

This signature helps them to find whether the information is duplicated or
not. Your design is also good..

However, there is one possible issue with this.. that is after you read a
key and matching documents, [ a, b, c, d, e ]. For the details you may need
to do a random read.. This will be a slow process.

MySQL (sharded) vs HBase - Please share the findings..

Cheers
Abinash Karan

-----Original Message-----
From: eric_bdr@yahoo.com [mailto:eric_bdr@yahoo.com] 
Sent: Monday, January 17, 2011 5:52 PM
To: dev@hbase.apache.org
Subject: evaluating HBase

Hi,

I am currently evaluating HBase for an implementation of an ERP-like cloud 
solution that's supposed to handle 500M lines per year for the biggest
tenant 
and 10-20m for the smaller tenants.  I am writing a couple prototypes, one
using 
MySQL (sharded) and one with HBase - I will let you know what I find if you
are 
interested.  Anyway, I have 2 questions:

The first one is regarding the following post and I would like to get a 
perspective from the no-sql camp on this one.
http://www.quora.com/Why-does-Quora-use-MySQL-as-the-data-store-rather-than-
NoSQLs-such-as-Cassandra-MongoDB-CouchDB-etc


The second is regarding how to best implement a 'duplicate check'
validation. 
 Here is what I have done so far: I have a single entity table and I have 
created an indexed table where the key is the concatenated value of the 4 
attributes of the entity (these 4 attributes are the definition of what 
constitutes a duplicate record while the entity can have around 100-150 
different attributes).  In this indexed table, I have a column in which I
store 
a comma delimited list of all the keys that corresponds to entities that
have 
the same set of 4 attribute values.

For example: (assuming that a dup is defined by entities having values of a
and 
b be the same)

EntityTable:
key, a, b, c, d, e
1, 1, 1, 1, 1, 1
2, 1, 1, 2, 2, 2
3, 1, 2, 2, 2, 2
4, 2, 2, 2, 2, 2

IndexTable:
key, value
11, [1, 2]
12, [3]
22, [4]

When I scan through my Entity table, I plan on looking up the index table by
the 
dup key and add the current entity key in it?  I am worried about this look
up 
per entity record for performance reasons?  To make things more complicated,
I 
should be able to change the set of keys that define a dup.  I handle that
by 
recreating my index table.

Is there a better way to write a dup check?

Thanks a lot for your help,
-Eric