You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by er...@yahoo.com on 2011/01/17 13:22:05 UTC
evaluating HBase
Hi,
I am currently evaluating HBase for an implementation of an ERP-like cloud
solution that's supposed to handle 500M lines per year for the biggest tenant
and 10-20m for the smaller tenants. I am writing a couple prototypes, one using
MySQL (sharded) and one with HBase - I will let you know what I find if you are
interested. Anyway, I have 2 questions:
The first one is regarding the following post and I would like to get a
perspective from the no-sql camp on this one.
http://www.quora.com/Why-does-Quora-use-MySQL-as-the-data-store-rather-than-NoSQLs-such-as-Cassandra-MongoDB-CouchDB-etc
The second is regarding how to best implement a 'duplicate check' validation.
Here is what I have done so far: I have a single entity table and I have
created an indexed table where the key is the concatenated value of the 4
attributes of the entity (these 4 attributes are the definition of what
constitutes a duplicate record while the entity can have around 100-150
different attributes). In this indexed table, I have a column in which I store
a comma delimited list of all the keys that corresponds to entities that have
the same set of 4 attribute values.
For example: (assuming that a dup is defined by entities having values of a and
b be the same)
EntityTable:
key, a, b, c, d, e
1, 1, 1, 1, 1, 1
2, 1, 1, 2, 2, 2
3, 1, 2, 2, 2, 2
4, 2, 2, 2, 2, 2
IndexTable:
key, value
11, [1, 2]
12, [3]
22, [4]
When I scan through my Entity table, I plan on looking up the index table by the
dup key and add the current entity key in it? I am worried about this look up
per entity record for performance reasons? To make things more complicated, I
should be able to change the set of keys that define a dup. I handle that by
recreating my index table.
Is there a better way to write a dup check?
Thanks a lot for your help,
-Eric
Re: evaluating HBase
Posted by Steven Noels <st...@outerthought.org>.
On Mon, Jan 17, 2011 at 2:53 PM, Thomas Koch <th...@koch.ro> wrote:
eric_bdr@yahoo.com:
> > Hi,
> >
> > I am currently evaluating HBase for an implementation of an ERP-like
> cloud
> > solution that's supposed to handle 500M lines per year for the biggest
> > tenant and 10-20m for the smaller tenants.
>
Hi Eric,
>
> have a look at lily:
> http://www.lilyproject.org
>
> Lily should become a scalable content repository for CMS'. But there's no
> reason why you couldn't use it for ERP solutions.
>
FWIW, we're currently doing a POC to validate the use of Lily as a product
(and more) catalogue underneath an e-commerce/retail platform.
Steven.
--
Steven Noels
http://outerthought.org/
Open Source Content Applications
Makers of Kauri, Daisy CMS and Lily
Re: evaluating HBase
Posted by Thomas Koch <th...@koch.ro>.
eric_bdr@yahoo.com:
> Hi,
>
> I am currently evaluating HBase for an implementation of an ERP-like cloud
> solution that's supposed to handle 500M lines per year for the biggest
> tenant and 10-20m for the smaller tenants.
Hi Eric,
have a look at lily:
http://www.lilyproject.org
Lily should become a scalable content repository for CMS'. But there's no
reason why you couldn't use it for ERP solutions.
Keep us posted!
Best regards,
Thomas Koch, http://www.koch.ro
RE: evaluating HBase
Posted by "Abinash Karana (Bizosys)" <ab...@bizosys.com>.
Hi Eric,
The duplicate record problem is addressed by Nutch by designing a signature.
This signature helps them to find whether the information is duplicated or
not. Your design is also good..
However, there is one possible issue with this.. that is after you read a
key and matching documents, [ a, b, c, d, e ]. For the details you may need
to do a random read.. This will be a slow process.
MySQL (sharded) vs HBase - Please share the findings..
Cheers
Abinash Karan
-----Original Message-----
From: eric_bdr@yahoo.com [mailto:eric_bdr@yahoo.com]
Sent: Monday, January 17, 2011 5:52 PM
To: dev@hbase.apache.org
Subject: evaluating HBase
Hi,
I am currently evaluating HBase for an implementation of an ERP-like cloud
solution that's supposed to handle 500M lines per year for the biggest
tenant
and 10-20m for the smaller tenants. I am writing a couple prototypes, one
using
MySQL (sharded) and one with HBase - I will let you know what I find if you
are
interested. Anyway, I have 2 questions:
The first one is regarding the following post and I would like to get a
perspective from the no-sql camp on this one.
http://www.quora.com/Why-does-Quora-use-MySQL-as-the-data-store-rather-than-
NoSQLs-such-as-Cassandra-MongoDB-CouchDB-etc
The second is regarding how to best implement a 'duplicate check'
validation.
Here is what I have done so far: I have a single entity table and I have
created an indexed table where the key is the concatenated value of the 4
attributes of the entity (these 4 attributes are the definition of what
constitutes a duplicate record while the entity can have around 100-150
different attributes). In this indexed table, I have a column in which I
store
a comma delimited list of all the keys that corresponds to entities that
have
the same set of 4 attribute values.
For example: (assuming that a dup is defined by entities having values of a
and
b be the same)
EntityTable:
key, a, b, c, d, e
1, 1, 1, 1, 1, 1
2, 1, 1, 2, 2, 2
3, 1, 2, 2, 2, 2
4, 2, 2, 2, 2, 2
IndexTable:
key, value
11, [1, 2]
12, [3]
22, [4]
When I scan through my Entity table, I plan on looking up the index table by
the
dup key and add the current entity key in it? I am worried about this look
up
per entity record for performance reasons? To make things more complicated,
I
should be able to change the set of keys that define a dup. I handle that
by
recreating my index table.
Is there a better way to write a dup check?
Thanks a lot for your help,
-Eric