You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by llpind <so...@hotmail.com> on 2009/08/24 19:37:23 UTC

HBase data model question

Hey,

I'm trying to move a relational model to HBase, and would like some input.

Suppose i have constant stream of documents coming in, and I'd like to parse
these by a single word.

It makes sense to have this word as my rowkey, but I need a way to handle
duplicate word text.  Kind of a dicitionary

What is the best way to solve this in HBase?  timestamp in row key?  Since I
need a way to identify each word uniquely


Thanks.
-- 
View this message in context: http://www.nabble.com/HBase-data-model-question-tp25120285p25120285.html
Sent from the HBase User mailing list archive at Nabble.com.

RE: HBase data model question

Posted by "Hegner, Travis" <TH...@trilliumit.com>.

If you need to access the data from either perspective, then you'll probably have to create two separate tables, one with the indexing as described before, and one that looks like <word><doc id><pos>, so that you could scan per word. This would have to be handled through your application as well.

There are some indexing tools available to automate some of this for Hbase, but I'm not very versed in how to use them. I believe one is call "IndexTable".

Maybe someone with more experience there could jump in and offer a possible solution?

Travis Hegner
http://www.travishegner.com/

-----Original Message-----
From: llpind [mailto:sonny_heer@hotmail.com]
Sent: Monday, August 24, 2009 4:30 PM
To: hbase-user@hadoop.apache.org
Subject: RE: HBase data model question


Thanks, I think thats a good starting point.  Along the lines i was thinking,
but I couldn't figure out how to get all for a given lemma (not by doc id,
WP).  Looking at scanners again to see if can pull that off.
--
View this message in context: http://www.nabble.com/HBase-data-model-question-tp25120285p25123069.html
Sent from the HBase User mailing list archive at Nabble.com.


The information contained in this communication is confidential and is intended only for the use of the named recipient.  Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful.  If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender or our IT Department at  866.459.4599.

RE: HBase data model question

Posted by llpind <so...@hotmail.com>.

Thanks, I think thats a good starting point.  Along the lines i was thinking,
but I couldn't figure out how to get all for a given lemma (not by doc id,
WP).  Looking at scanners again to see if can pull that off.
-- 
View this message in context: http://www.nabble.com/HBase-data-model-question-tp25120285p25123069.html
Sent from the HBase User mailing list archive at Nabble.com.

RE: HBase data model question

Posted by "Hegner, Travis" <TH...@trilliumit.com>.

Here is an example of what I was thinking... Your row key could be <doc id><position><word> You would want the doc id and position to be fixed length, so choose wisely based on the length (number of words) and number of documents. A single row key might look like:

"0000012345000650lemma"

The 0000012345 would be the document ID, the 000650 is the position, and lemma being the word itself. With this set up, you could scan for all the words given a document ID, or a range of positions given a document ID. Scanning a range of positions (i.e. 0000012345000649 to 0000012345000651) would allow you to retrieve the surrounding words of a given document ID, and word position.

Just a thought, hope this helps.

Travis Hegner
http://www.travishegner.com/

-----Original Message-----
From: llpind [mailto:sonny_heer@hotmail.com]
Sent: Monday, August 24, 2009 2:37 PM
To: hbase-user@hadoop.apache.org
Subject: RE: HBase data model question

Good points.

Word combination is what i was trying to say.  Say i have a word (lemma),
and need words before & after queryable by lemma.  lets call this a
sentence.  So my rowkey will essentially be a sentence (with doc id).  But i
can have identical rowkeys still within a document.  Maybe i'm missing
something... hmm...

Travis Hegner wrote:
>
> What about a document-id, word-position, and word combination. With the
> proper combo all words in a single document would be located near
> each-other.
>
> Travis Hegner
> http://www.travishegner.com/
>
>
> -----Original Message-----
> From: llpind [mailto:sonny_heer@hotmail.com]
> Sent: Monday, August 24, 2009 1:37 PM
> To: hbase-user@hadoop.apache.org
> Subject: HBase data model question
>
>
> Hey,
>
> I'm trying to move a relational model to HBase, and would like some input.
>
> Suppose i have constant stream of documents coming in, and I'd like to
> parse
> these by a single word.
>
> It makes sense to have this word as my rowkey, but I need a way to handle
> duplicate word text.  Kind of a dicitionary
>
> What is the best way to solve this in HBase?  timestamp in row key?  Since
> I
> need a way to identify each word uniquely
>
>
> Thanks.
> --
> View this message in context:
> http://www.nabble.com/HBase-data-model-question-tp25120285p25120285.html
> Sent from the HBase User mailing list archive at Nabble.com.
>
>
> The information contained in this communication is confidential and is
> intended only for the use of the named recipient.  Unauthorized use,
> disclosure, or copying is strictly prohibited and may be unlawful.  If you
> have received this communication in error, you should know that you are
> bound to confidentiality, and should please immediately notify the sender
> or our IT Department at  866.459.4599.
>
>

--
View this message in context: http://www.nabble.com/HBase-data-model-question-tp25120285p25121358.html
Sent from the HBase User mailing list archive at Nabble.com.

The information contained in this communication is confidential and is intended only for the use of the named recipient.  Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful.  If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender or our IT Department at  866.459.4599.

RE: HBase data model question

Posted by llpind <so...@hotmail.com>.

Good points.

Word combination is what i was trying to say.  Say i have a word (lemma),
and need words before & after queryable by lemma.  lets call this a
sentence.  So my rowkey will essentially be a sentence (with doc id).  But i
can have identical rowkeys still within a document.  Maybe i'm missing
something... hmm...



Travis Hegner wrote:
> 
> What about a document-id, word-position, and word combination. With the
> proper combo all words in a single document would be located near
> each-other.
> 
> Travis Hegner
> http://www.travishegner.com/
> 
> 
> -----Original Message-----
> From: llpind [mailto:sonny_heer@hotmail.com]
> Sent: Monday, August 24, 2009 1:37 PM
> To: hbase-user@hadoop.apache.org
> Subject: HBase data model question
> 
> 
> Hey,
> 
> I'm trying to move a relational model to HBase, and would like some input.
> 
> Suppose i have constant stream of documents coming in, and I'd like to
> parse
> these by a single word.
> 
> It makes sense to have this word as my rowkey, but I need a way to handle
> duplicate word text.  Kind of a dicitionary
> 
> What is the best way to solve this in HBase?  timestamp in row key?  Since
> I
> need a way to identify each word uniquely
> 
> 
> Thanks.
> --
> View this message in context:
> http://www.nabble.com/HBase-data-model-question-tp25120285p25120285.html
> Sent from the HBase User mailing list archive at Nabble.com.
> 
> 
> The information contained in this communication is confidential and is
> intended only for the use of the named recipient.  Unauthorized use,
> disclosure, or copying is strictly prohibited and may be unlawful.  If you
> have received this communication in error, you should know that you are
> bound to confidentiality, and should please immediately notify the sender
> or our IT Department at  866.459.4599.
> 
> 

-- 
View this message in context: http://www.nabble.com/HBase-data-model-question-tp25120285p25121358.html
Sent from the HBase User mailing list archive at Nabble.com.

RE: HBase data model question

Posted by "Hegner, Travis" <TH...@trilliumit.com>.

What about a document-id, word-position, and word combination. With the proper combo all words in a single document would be located near each-other.

Travis Hegner
http://www.travishegner.com/

-----Original Message-----
From: llpind [mailto:sonny_heer@hotmail.com]
Sent: Monday, August 24, 2009 1:37 PM
To: hbase-user@hadoop.apache.org
Subject: HBase data model question

Hey,

I'm trying to move a relational model to HBase, and would like some input.

Suppose i have constant stream of documents coming in, and I'd like to parse
these by a single word.

It makes sense to have this word as my rowkey, but I need a way to handle
duplicate word text.  Kind of a dicitionary

What is the best way to solve this in HBase?  timestamp in row key?  Since I
need a way to identify each word uniquely

Thanks.
--
View this message in context: http://www.nabble.com/HBase-data-model-question-tp25120285p25120285.html
Sent from the HBase User mailing list archive at Nabble.com.

The information contained in this communication is confidential and is intended only for the use of the named recipient.  Unauthorized use, disclosure, or copying is strictly prohibited and may be unlawful.  If you have received this communication in error, you should know that you are bound to confidentiality, and should please immediately notify the sender or our IT Department at  866.459.4599.