You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Phillip Nelson <ph...@gmail.com> on 2010/11/04 06:16:41 UTC

structured data knowledge store in HBase

Hi Guys,

Thank you guys ahead of time for reading through this and for any feedback you guys can give. I'm relatively new to HBase but I'm really enjoying working with it.

I'm working on a project to store a large amount of simple structured data into HBase. Basically, it's a subset of owl+rdf: each object has a set of types, and then a set of predicate(property) => value mappings.

My first design is this: public_objects with two column families: t: for types and p: for properties.

in the current set-up, here's an example of a wikipedia object: (with some formatting)

hbase(main):005:0> get 'public_objects', 'http://dbpedia.org/resource/%21Hero'
COLUMN CELL
p:http://dbpedia.org/ontology/basedOn value=o:http://dbpedia.org/resource/Bible
p:http://dbpedia.org/ontology/musicBy value=['o:http://dbpedia.org/resource/Eddie_DeGarmo', 'o:http://dbpedia.org/resource/Farrell_and_Farrell']
p:http://xmlns.com/foaf/0.1/name value=l:!Hero
t:o:http://dbpedia.org/onto logy/Musical value=1
t:o:http://dbpedia.org/ontology/Work value=1

so this is great because I can quickly scan for all musicals (scan 'public_objects' {COLUMNS=> 't:o:http://dbpedia.org/onto logy/Musical'}). But it's definitely not good enough. So there are a few questions:

1. when I have many-to-one relationships, I serialize the python list and slap it into the value. i don't think this will be too expensive to match, especially if i don't have to deserialize, in my mappers... but is there a better way to do this type of relationship? I also need to differentiate between objects, classes, and literals. (hence the hack-ish namespacing of uris).

2. Ideally i want to be able to do scans for types AND properties, and feed the values into my M/R process... is there a good way to do this? I was thinking of concatenating the type and the property into the p: coumn value (ie p:http://dbpedia.org/ontology/Work_http://xmlns.com/foaf/0.1/name) but this would have to be repeated for each property.

3. Somewhat unrelated to schema design- how do secondary tableindexes fit into this? I don't see how this is accessed via the thrift interface.

Thanks again,
Phillip Nelson