You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Mark Jarecki <mj...@bigpond.net.au> on 2010/11/22 07:45:54 UTC

Cell versioning/timestamp limits

Hi, 

I'm completely new to HBase and have some questions regarding cell timestamps. 

My questions: Are there practical limitations to the number of versions (timestamps) a cell can have? Can a cell have, say, a million versions? What are the consequences of this many versions to performance and system requirements? Or instead, should composite row keys be used instead as sorted indexes when numbers are this high?

To illustrate my questions, I'm modelling the messages exchanged between any 2 users on our system. The table is called "messages", the row key is a composite of the two users' ids involved in the message exchange (e.g. "user1:user2"). A column (e.g. "exchanges:message") contains a cell that is regularly updated with the last message between those users. The cell's timestamp is then used in conjunction with Get.setMaxVersions() and Get.setTimeRange() to enable queries such as "Get the messages exchanged between user1 and user2 since 12th October 12:02:02" or "Get the last 25 messages exchanged  between user1 and user2" or "Get all messages exchanged  between user1 and user2".

messages :  {
	…
	user1:user2 :  {
		exchanges:message : {
			...
			t3: "Not bad",
			t2: "How's it going?",
			t1: "Hello"
		}
	},
	…
} 

Over time, the number of messages exchanged between the 2 users will be substantial - and growing. I'm concerned that cell versioning was NOT intended for this purpose, and there might be a consequences for having, say a million versions of a cell, 

Thanks in advance.

Mark

Re: Cell versioning/timestamp limits

Posted by Lars George <la...@gmail.com>.
I agree with Mark. HBase starts the built in ZK support on the nodes that are listed in the quorum. That is why it works as Mark says when you add the ejabber host. 

What is broken is your job config. For some reason you do not seem to have the right config in your jar as it tries to connect to localhost.  Fix that config and it should work. 

An idea, print out the config in your code to see what you get during the MR job run. That may help you verify what is going on.

Note, just because you have a hbase-site.xml on your nodes does NOT mean the MR job picks it up! It must be on the class path for the MR task!

Lars

On Nov 22, 2010, at 7:45, Mark Jarecki <mj...@bigpond.net.au> wrote:

> Hi, 
> 
> I'm completely new to HBase and have some questions regarding cell timestamps. 
> 
> My questions: Are there practical limitations to the number of versions (timestamps) a cell can have? Can a cell have, say, a million versions? What are the consequences of this many versions to performance and system requirements? Or instead, should composite row keys be used instead as sorted indexes when numbers are this high?
> 
> To illustrate my questions, I'm modelling the messages exchanged between any 2 users on our system. The table is called "messages", the row key is a composite of the two users' ids involved in the message exchange (e.g. "user1:user2"). A column (e.g. "exchanges:message") contains a cell that is regularly updated with the last message between those users. The cell's timestamp is then used in conjunction with Get.setMaxVersions() and Get.setTimeRange() to enable queries such as "Get the messages exchanged between user1 and user2 since 12th October 12:02:02" or "Get the last 25 messages exchanged  between user1 and user2" or "Get all messages exchanged  between user1 and user2".
> 
> messages :  {
> 	…
> 	user1:user2 :  {
> 		exchanges:message : {
> 			...
> 			t3: "Not bad",
> 			t2: "How's it going?",
> 			t1: "Hello"
> 		}
> 	},
> 	…
> } 
> 
> Over time, the number of messages exchanged between the 2 users will be substantial - and growing. I'm concerned that cell versioning was NOT intended for this purpose, and there might be a consequences for having, say a million versions of a cell, 
> 
> Thanks in advance.
> 
> Mark

Re: Cell versioning/timestamp limits

Posted by Lars George <la...@gmail.com>.
Hi Mark,

First please read this post: http://outerthought.org/blog/417-ot.html

Rest inline below.

On Nov 22, 2010, at 7:45, Mark Jarecki <mj...@bigpond.net.au> wrote:

> Hi, 
> 
> I'm completely new to HBase and have some questions regarding cell timestamps. 
> 
> My questions: Are there practical limitations to the number of versions (timestamps) a cell can have? Can a cell have, say, a million versions? What are the consequences of this many versions to performance and system requirements? Or instead, should composite row keys be used instead as sorted indexes when numbers are this high?

You could use Integer.MAX_VALUE versions. So quite a lot :) The issue is that the system needs to search for matches so the more you have the more it needs to scan for it. It may also blow out the size of the store file since they all belong to one row and therefore cannot be split. 

If you expect many versions or large cell sizes you may be better off doing the composite keys approach. 
 
> To illustrate my questions, I'm modelling the messages exchanged between any 2 users on our system. The table is called "messages", the row key is a composite of the two users' ids involved in the message exchange (e.g. "user1:user2"). A column (e.g. "exchanges:message") contains a cell that is regularly updated with the last message between those users. The cell's timestamp is then used in conjunction with Get.setMaxVersions() and Get.setTimeRange() to enable queries such as "Get the messages exchanged between user1 and user2 since 12th October 12:02:02" or "Get the last 25 messages exchanged  between user1 and user2" or "Get all messages exchanged  between user1 and user2".
> 
> messages :  {
> 	…
> 	user1:user2 :  {
> 		exchanges:message : {
> 			...
> 			t3: "Not bad",
> 			t2: "How's it going?",
> 			t1: "Hello"
> 		}
> 	},
> 	…
> } 
> 
> Over time, the number of messages exchanged between the 2 users will be substantial - and growing. I'm concerned that cell versioning was NOT intended for this purpose, and there might be a consequences for having, say a million versions of a cell,

Yeah, this is not really what you want to solve with versions then. If you were to add the timestamp to the user1:user2:ts key then you can use scans to get messages between two timestamps etc. just the same. 

> 
> 
> Thanks in advance.
> 
> Mark

Lars