You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by tim robertson <ti...@gmail.com> on 2009/08/19 17:53:19 UTC

Many 2 one in a row - modeling options

Hi all,

I have just started a project to research the migration of a
biodiversity occurrence index (plant / animal specimens collected or
observed) from mysql to HBase.

We have source records that inherently have a many 2 one.  Think of
"Scientist A identified this as a Felis concolor concolor" but 25
years later "Scientist B identified the same preserved specimen as a
Puma concolor".  This scientific identification has more attributes
and there will always be 1 or more (could be 10s of them) for the same
specimen.

I am pondering how to model this in HBase seeing a few obvious options:
- serializing the scientific identification "List" as bytes
- expanding the record into 2 or more rows indicating the rows were
derived from the same source
- expand the identifications into new families
- expand the identification fields into multiple fields in the same family
- consider more than 1 table

All of the above have pros and cons with respect to client code
complexity and performance.

I have put up a vrey simple example record on
http://code.google.com/p/biodiversity/wiki/HBaseSchema and would
welcome any comments on this list or on the wiki directly.

Please note that I have only just started the project so the
documentation is really just starting up at this point, but this will
be a case study of a migration from mysql which might be of interest
to others.

Thanks,

Tim

Re: Many 2 one in a row - modeling options

Posted by tim robertson <ti...@gmail.com>.
Thanks JG for taking the time to digest that and comment.

It was a hastily written page as I am on vacation... I'm heading over
to the bay area for a few days and wanted to start getting something
together to discuss with Stack and anyone else free on Tuesday 8th.  I
hope it develops into a good case study for the HBase community.

In terms of operations, I think it will boil down to 3 things:
 a) full scans building search indexes (probably lucene)

b) scanning and annotating same row (such as a field for a named
national park the point falls in)
    -  in terms of the scientific identification there will be a fair
amount of scanning the identification and then annotating a new column
with an ID for the equivalent of an external taxonomy.  One example
might be that you want to browse the occurrence store (specimens and
observations) using the Catalogue of Life taxonomy, and this record
would be found with the identifier of
http://www.catalogueoflife.org/annual-checklist/2008/show_species_details.php?record_id=5204463.
 It is not as simple as a name match as the synonymy is subject to
differing opinion.

c) scans with filters.  I expect that most of these raw values will be
parsed out to better typed values (hash encoded lat long etc) and the
filters would be on those families and not these "raw" families.  I
think the identifications would be parsed to ID's of well known
taxonomies, and the filters would be using those values.

I was expecting serializing to be the most likely choice, and I'll
start checking out the protobufs stuff - I have been writing my own
serializations based on Writable for storing values in lucene indexes
recently.

I'll clean up the wiki and probably have more questions,

Cheers,

Tim



On Wed, Aug 19, 2009 at 7:20 PM, Jonathan Gray<jl...@streamy.com> wrote:
> Tim,
>
> Very cool wiki page.  Unfortunately I'm a little confused about exactly what
> the requirements are.
>
> Does each species (and the combination of all of its identifications)
> actually have a single, unique ID?
>
> The most important thing when designing your HBase schema is to understand
> how you want to query it.  And I'm not exactly sure I follow that part.
>
> I'm going to assume that there is a single, relatively static set of
> attributes for each unique ID (the GUID, Cat#, etc).  Let's put that in a
> family, call it "attributes".  You would use that family as a key/value
> dictionary.  The qualifier would be the attribute name, and the value would
> be the attribute value (ie. attributes:InstCode with value MNHA).
>
> The row, in this case, would be the GUID or whatever unique ID you want to
> lookup by.
>
> Now the other part, storing the identifications.  I would definitely vote
> against multiples rows, multiple tables, and multiple families.  As you
> point out, multiple tables would require joining, multiple families does in
> fact mean 2 separate files, and multiple rows adds a great deal of
> complexity (you need to Scan now, cannot rely on Get).
>
> So let's say we have a family "identifications" (though you may want to
> shorten these family names as they are actually stored explicitly for every
> single cell... maybe "ids").  For each identification, you would have a
> single column.  The qualifier of that column would be whatever the unique
> identifier is for that identification, or if there isn't one, you could just
> wrap up the entire thing in to a serialized type and use that as the
> qualifier.  If you have an ID, then I would serialize the identification
> into the value.
>
> You point out that this would have poor scanning performance because of the
> need for deserialization, but I don't necessarily agree.  That can be quite
> fast, depending on implementation, and there's a great deal of
> serialization/deserialization being done behind the scenes to even get the
> data to you in the first place.
>
> Something like protobufs has very efficient and fast serialize/deserialize
> operations.  Java serialization is inefficient in space and can be slow,
> which is why HBase and Hadoop implement the Writable interface and provide a
> minimal/efficient/binary serialization.
>
> I do think that is the by far the best approach here, the
> serialization/deserialization should be orders of magnitude faster than
> round-trip network latency.
>
> I didn't realize your first bullet was what it was, I thought you were
> talking about serializing the entire thing in one column.  Looking again, it
> seems you're on the right track and that would be the simplest and fastest
> approach.
>
> Keep us updated!
>
> JG
>
>
>
> tim robertson wrote:
>>
>> Hi all,
>>
>> I have just started a project to research the migration of a
>> biodiversity occurrence index (plant / animal specimens collected or
>> observed) from mysql to HBase.
>>
>> We have source records that inherently have a many 2 one.  Think of
>> "Scientist A identified this as a Felis concolor concolor" but 25
>> years later "Scientist B identified the same preserved specimen as a
>> Puma concolor".  This scientific identification has more attributes
>> and there will always be 1 or more (could be 10s of them) for the same
>> specimen.
>>
>> I am pondering how to model this in HBase seeing a few obvious options:
>> - serializing the scientific identification "List" as bytes
>> - expanding the record into 2 or more rows indicating the rows were
>> derived from the same source
>> - expand the identifications into new families
>> - expand the identification fields into multiple fields in the same family
>> - consider more than 1 table
>>
>> All of the above have pros and cons with respect to client code
>> complexity and performance.
>>
>> I have put up a vrey simple example record on
>> http://code.google.com/p/biodiversity/wiki/HBaseSchema and would
>> welcome any comments on this list or on the wiki directly.
>>
>> Please note that I have only just started the project so the
>> documentation is really just starting up at this point, but this will
>> be a case study of a migration from mysql which might be of interest
>> to others.
>>
>> Thanks,
>>
>> Tim
>>
>

Re: Many 2 one in a row - modeling options

Posted by Jonathan Gray <jl...@streamy.com>.
Tim,

Very cool wiki page.  Unfortunately I'm a little confused about exactly 
what the requirements are.

Does each species (and the combination of all of its identifications) 
actually have a single, unique ID?

The most important thing when designing your HBase schema is to 
understand how you want to query it.  And I'm not exactly sure I follow 
that part.

I'm going to assume that there is a single, relatively static set of 
attributes for each unique ID (the GUID, Cat#, etc).  Let's put that in 
a family, call it "attributes".  You would use that family as a 
key/value dictionary.  The qualifier would be the attribute name, and 
the value would be the attribute value (ie. attributes:InstCode with 
value MNHA).

The row, in this case, would be the GUID or whatever unique ID you want 
to lookup by.

Now the other part, storing the identifications.  I would definitely 
vote against multiples rows, multiple tables, and multiple families.  As 
you point out, multiple tables would require joining, multiple families 
does in fact mean 2 separate files, and multiple rows adds a great deal 
of complexity (you need to Scan now, cannot rely on Get).

So let's say we have a family "identifications" (though you may want to 
shorten these family names as they are actually stored explicitly for 
every single cell... maybe "ids").  For each identification, you would 
have a single column.  The qualifier of that column would be whatever 
the unique identifier is for that identification, or if there isn't one, 
you could just wrap up the entire thing in to a serialized type and use 
that as the qualifier.  If you have an ID, then I would serialize the 
identification into the value.

You point out that this would have poor scanning performance because of 
the need for deserialization, but I don't necessarily agree.  That can 
be quite fast, depending on implementation, and there's a great deal of 
serialization/deserialization being done behind the scenes to even get 
the data to you in the first place.

Something like protobufs has very efficient and fast 
serialize/deserialize operations.  Java serialization is inefficient in 
space and can be slow, which is why HBase and Hadoop implement the 
Writable interface and provide a minimal/efficient/binary serialization.

I do think that is the by far the best approach here, the 
serialization/deserialization should be orders of magnitude faster than 
round-trip network latency.

I didn't realize your first bullet was what it was, I thought you were 
talking about serializing the entire thing in one column.  Looking 
again, it seems you're on the right track and that would be the simplest 
and fastest approach.

Keep us updated!

JG



tim robertson wrote:
> Hi all,
> 
> I have just started a project to research the migration of a
> biodiversity occurrence index (plant / animal specimens collected or
> observed) from mysql to HBase.
> 
> We have source records that inherently have a many 2 one.  Think of
> "Scientist A identified this as a Felis concolor concolor" but 25
> years later "Scientist B identified the same preserved specimen as a
> Puma concolor".  This scientific identification has more attributes
> and there will always be 1 or more (could be 10s of them) for the same
> specimen.
> 
> I am pondering how to model this in HBase seeing a few obvious options:
> - serializing the scientific identification "List" as bytes
> - expanding the record into 2 or more rows indicating the rows were
> derived from the same source
> - expand the identifications into new families
> - expand the identification fields into multiple fields in the same family
> - consider more than 1 table
> 
> All of the above have pros and cons with respect to client code
> complexity and performance.
> 
> I have put up a vrey simple example record on
> http://code.google.com/p/biodiversity/wiki/HBaseSchema and would
> welcome any comments on this list or on the wiki directly.
> 
> Please note that I have only just started the project so the
> documentation is really just starting up at this point, but this will
> be a case study of a migration from mysql which might be of interest
> to others.
> 
> Thanks,
> 
> Tim
>