You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by tim robertson <ti...@gmail.com> on 2009/06/28 16:43:21 UTC

HBase integration - DAO vs more "loosely defined" data access

Hi all,

I am curious how people are structuring their data access code when
using HBase, so I was hoping for insights from the community.
I represent one of those developers with lots of experience with
relational DBs and Spring and Hibernate etc. and now exploring HBase
due to hitting limits in mysql (2 tables each with 200 million rows).
This is *not* an RDBMS vs HBase question, but more related to how to
cleanly structure application code once HBase has been decided upon.

So far, I have done 2 small sample projects each differently:

- The first one, I kind of copied the Spring JDBC DAO approach, and
created a POJO factory per column family, applied to each RowResult
when scanning.  So basically I abstracted a CRUD interface and search
methods that handled POJO objects.  I then did a Spring wiring of the
DAO into the application (I guess it just felt normal to do that at
that time ;)

- The second one I had reasonably well defined terms in the
application (e.g. dwc:scientificName) and then I built a layer that
used various properties files to map from my well defined terms to
tables, families and columns.  E.g. an
insertHarvestedRecord(Map<String, String> data) method might pick up a
prop file mapping dwc:scientificName to the "unparsed" family, but
another method might map it to a different column family altogether.
Additionally, I was loading in lots of CSV data, and was able to do
CSV column to HBase family:column kind of mapping, which worked nicely
(although I would run this through MapReduce to load it now to
distribute the loading)

The first approach I found quite limiting as changes meant a lot of
tedious coding and recompilation, but it did catch errors early.  One
of the motivations of this approach, was that I could also get other
developers to work on top with no knowledge of the data store
(possibly that was a con and not a pro anyway as I expect a lot of
MapReduce operations on the data).

The second approach I found super flexible, but the effort was in
maintaining test cases to catch changes.  I ended up dealing with a
lot of List<Map<String, String>> situations, and definitely the data
store became more "embedded" in the application code itself.

Has anyone got any nice insights to share to those moving from the
typical spring / hibernate world?
Do you use the HBase API natively?
Enumerations for column defs?  Hardcoded Strings?
Did the ORM thing
(http://www.nabble.com/Hbase-ORM-any-one-interested--td19739869.html)
take off?
Maybe there is no one approach is best fit anyway...

Cheers,

Tim

Re: HBase integration - DAO vs more "loosely defined" data access

Posted by stack <st...@duboce.net>.

On Sun, Jun 28, 2009 at 7:43 AM, tim robertson <ti...@gmail.com>wrote:

>
> Did the ORM thing
> (http://www.nabble.com/Hbase-ORM-any-one-interested--td19739869.html)
> take off?
>

Hey Tim:  Above ORM thing became:
http://belowdeck.kissintelligentsystems.com/ohm/  Not sure of current state.
St.Ack

Re: HBase integration - DAO vs more "loosely defined" data access

Posted by Jonathan Gray <jl...@streamy.com>.

Tim,

I came out of the relational world to HBase, but definitely not DAO or 
Spring/Hibernate type stuff, so I don't have much direct advice in that 
regard.

But very quickly I thought I'd mention how we work with it here...

We wrote our own "query language" which is not really a language but 
more an internally used API that we continuously extend as we use HBase 
in new ways.

For example, 75% of our storage in HBase follows one of two patterns:

- Small to medium key/val dictionaries (usually between 5 and 25 keys, 
though some values are on the order of 100K)

- Ordered lists (1 to 1M+ entries)

These both map well to HBase.  But our language exposes individual 
methods for these two data structures and is agnostic to how these 
things are being stored (and pre-0.20 we are using a caching hierarchy, 
part of the reason for a separate API that fit the caching model).  For 
testing purposes, we have even implemented parts of our API in an RDBMS 
or simple KV store like BDB or TC (even sqlite).

As we've progressed, we've added extra bits like secondary index tables, 
lucene indexes, etc.  Rather than dealing with that at the application 
level, we extend our API with additional calls.  Again, it's agnostic to 
the implementation but still exposes a fairly low-level / bare bones 
API.  So in between the database and the application layer is a small, 
lightweight translation between our internal data API and the backing 
database's API.

App developers do need to learn something new but you can add some 
structure by exposing the functionality/data structures/queries you 
need, and these can be much more well defined than the HBase API (which 
is a bit of a blank slate).

JG

tim robertson wrote:
> Hi all,
> 
> I am curious how people are structuring their data access code when
> using HBase, so I was hoping for insights from the community.
> I represent one of those developers with lots of experience with
> relational DBs and Spring and Hibernate etc. and now exploring HBase
> due to hitting limits in mysql (2 tables each with 200 million rows).
> This is *not* an RDBMS vs HBase question, but more related to how to
> cleanly structure application code once HBase has been decided upon.
> 
> So far, I have done 2 small sample projects each differently:
> 
> - The first one, I kind of copied the Spring JDBC DAO approach, and
> created a POJO factory per column family, applied to each RowResult
> when scanning.  So basically I abstracted a CRUD interface and search
> methods that handled POJO objects.  I then did a Spring wiring of the
> DAO into the application (I guess it just felt normal to do that at
> that time ;)
> 
> - The second one I had reasonably well defined terms in the
> application (e.g. dwc:scientificName) and then I built a layer that
> used various properties files to map from my well defined terms to
> tables, families and columns.  E.g. an
> insertHarvestedRecord(Map<String, String> data) method might pick up a
> prop file mapping dwc:scientificName to the "unparsed" family, but
> another method might map it to a different column family altogether.
> Additionally, I was loading in lots of CSV data, and was able to do
> CSV column to HBase family:column kind of mapping, which worked nicely
> (although I would run this through MapReduce to load it now to
> distribute the loading)
> 
> The first approach I found quite limiting as changes meant a lot of
> tedious coding and recompilation, but it did catch errors early.  One
> of the motivations of this approach, was that I could also get other
> developers to work on top with no knowledge of the data store
> (possibly that was a con and not a pro anyway as I expect a lot of
> MapReduce operations on the data).
> 
> The second approach I found super flexible, but the effort was in
> maintaining test cases to catch changes.  I ended up dealing with a
> lot of List<Map<String, String>> situations, and definitely the data
> store became more "embedded" in the application code itself.
> 
> Has anyone got any nice insights to share to those moving from the
> typical spring / hibernate world?
> Do you use the HBase API natively?
> Enumerations for column defs?  Hardcoded Strings?
> Did the ORM thing
> (http://www.nabble.com/Hbase-ORM-any-one-interested--td19739869.html)
> take off?
> Maybe there is no one approach is best fit anyway...
> 
> Cheers,
> 
> Tim
>