You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by jt...@ina.fr on 2009/02/18 19:29:09 UTC

Re : Re: Table design question

> On Wed, Feb 18, 2009 at 2:24 AM, Jérôme Thièvre INA 
> <jt...@ina.fr> wrote:
> 
> > Hi,
> >
> > I setup a cluster of 4 machines running hbase.
> > I'm working on a web archiving application that needs to access 
> (randomly)> records with request of type :
> >
> > Record record = getClosestRecord(url, requestedDate);
> > This method should find the record for the specified url at the 
> *nearest> *date
> > from the requestedDate. The requested dates have very little 
> chance to
> > match
> > insertion date.
> 
> 
> (wayback machine?)
> 

Kind of wayback machine but based on a proxy, we don't rewrite url.


> 
> Currently we can only return records at an explicit date or older, not
> newer.
> 
> 
> Each record is made of 10 columns, and each insert is of the type;
> >
> > insertRecord(url, date, record);
> >
> > There are several possible designs for my record table :
> >
> > 1. RowKey= url and all columns are labelled with the same date.
> 
> 2. RowKey=url and we use timestamp and version support of hbase, 
> and columns
> > names are columnFamily names (no label).
> >
> 3. RowKey=url+date, and columns names are columnFamily names (no 
> label).>
> 
> Examples please (I've only had one cup of coffee so far this morning).
> 
> 


 Supposed colum families are : {'content:', 'type:'} 
I want to insert a new record with url www.google.com at date 20090218 :

Case 1: 
BactUpdate update = new BacthUpdate(www.google.com);
update.put('content:20090218', 1ffe36e5b13f28e69c2886f40fd3fcea2ce05d030b508c11d714dead5d69000f);
update.put('type:20090218', 'text/html');
table.commit(update);

Case 2: Implies use hbase versioning 
BactUpdate update = new BacthUpdate(www.google.com, toTimestamp(20090218 ));
update.put('content:', 1ffe36e5b13f28e69c2886f40fd3fcea2ce05d030b508c11d714dead5d69000f);
update.put('type:', 'text/html');
table.commit(update);

Case3:
BactUpdate update = new BacthUpdate(www.google.com@20090218);
update.put('content:', 1ffe36e5b13f28e69c2886f40fd3fcea2ce05d030b508c11d714dead5d69000f);
update.put('type:', 'text/html');
table.commit(update);

> 
> >
> > For now, I use method 1 that implies to answer correctly to
> > getClosestRecord
> > to load an entire columnFamily for a specified row,
> > to find the closest date among the columnFamily, and to load  the 
> others> columns labelled with this closest date.
> > I choose this method because I thought I could use the method
> > HTable.getClosestRowBefore(url, columFamily:requestedDate) to 
> minimize> column loads, but in fact I need the closest row before 
> and the closest row
> > after to determine which one is at the closest date, so I don't 
> use the
> > method getClosestRowBefore.
> >
> > The solution 2. seems to be a good alternative, I could have the 
> same> fonctionnality with the same process, but date would be 
> stored once per row
> > insert (as timestamp) instead of once per column.
> 
> 
> 
> This seems like a better hbase fit.
> 
> 
> 
> >
> >
> > Solution 3. implies only one insert per row key, but increases 
> dramatically> the number of rows.
> >
> 
> Yeah, but you can scan them quickly.  Good for finding date ranges 
> (until we
> enrichen the API and allow get/scan between date ranges).  You'll 
> probablyhave to do as hbase does internally, do a little trick so 
> the newest insert
> shows first -- rather than last.
> 

We can thousands of differents version date for some url.

Is it possible (or will it be) to load column names without load cell content ? Same questions for the timestamp ?


> St.Ack
> 
> >
> > What is the best solution to ensure best random acces time ?
> >
> > Jérôme Thièvre
> >
> 

Re : Re: Re : Re: Re : Re: Table design question

Posted by jt...@ina.fr.
Sorry, the problem was caused by a bug from my code.

So, it works, I can identify the good row.

In fact my row keys are coded as revserseHostUrl@date1-date2
More like theses ones :

www.google.com@200801-200802
www.google.com@200902-200904
www.google.com@201001-201002

To identify the good row for the request www.google.com@200901, I should access two rows to find the closest date interval :

getClosestRowBefore(www.google.com@200901) = > www.google.com@200801-200802
getScanner(www.google.com@200801-200802*).next() => www.google.com@200902-200904

I use this method, and it works, but it is really slow, even if make the same requests several times !
All my columns are block cached.
Does theses two methods benefit from block caching ? 

Thank you for your time !

Jérôme 


----- Message d'origine -----
De: stack <st...@duboce.net>
Date: Vendredi, Février 27, 2009 7:10 pm
Objet: Re: Re : Re: Re : Re: Table design question

> getClosestRowBefore should work.  What are you supplying for row?  The
> column you ask for exists?
> 
> What happens if you open a scanner at the (non-existent) row 
> 'www.google.com@'?
> 
> St.Ack
> 
> On Fri, Feb 27, 2009 at 8:02 AM, <jt...@ina.fr> wrote:
> 
> > Hi,
> >
> > following the discussion with Stack, I have modified the way I 
> insert data
> > in hbase.
> >
> > Now, I insert data in an htable using url@date as row key.
> > Like this :
> >
> > Case3:
> > BactUpdate update = new BacthUpdate(www.google.com@20090218);
> > update.put('content:',
> > 1ffe36e5b13f28e69c2886f40fd3fcea2ce05d030b508c11d714dead5d69000f);
> > update.put('type:', 'text/html');
> > table.commit(update);
> >
> > I want to access this rows but with inexact keys. If i have 
> inserted these
> > rows :
> >
> > www.google.com@200801
> > www.google.com@200901
> > www.google.com@201001
> >
> > and make this request :
> >
> > www.google.com@200902, I would like to find the row with the 
> specified url
> > at the closest date from 200902 (www.google.com@200901 in my case)
> >
> > So, I thought i could use the method : 
> HTable.getClosestRowBefore(byte[]> row, byte[] column) to identify 
> a row which the key is less than the
> > requested one, and then scan to identify precisely the good row.
> >
> >
> > In fact, this methods returns always the row with the null key if 
> I request
> > a row that doesn't exactly match an inserted one.
> >
> > Is there really a way to make this kind of request in hbase ?
> >
> > Jérôme Thièvre
> >
> >
> >
> >
> >
> > ----- Message d'origine -----
> > De: stack <st...@duboce.net>
> > Date: Mercredi, Février 18, 2009 10:48 pm
> > Objet: Re: Re : Re: Table design question
> >
> > > On Wed, Feb 18, 2009 at 10:29 AM, <jt...@ina.fr> wrote:
> > >
> > > > >
> > > > > Currently we can only return records at an explicit date or
> > > older, not
> > > > > newer.
> > > > >
> > > > >
> > > > > Each record is made of 10 columns, and each insert is of 
> the type;
> > > > > >
> > > > > > insertRecord(url, date, record);
> > > > > >
> > > > > > There are several possible designs for my record table :
> > > > > >
> > > > > > 1. RowKey= url and all columns are labelled with the same 
> date.> > > >
> > > > > 2. RowKey=url and we use timestamp and version support of 
> hbase,> > > > and columns
> > > > > > names are columnFamily names (no label).
> > > > > >
> > > > > 3. RowKey=url+date, and columns names are columnFamily 
> names (no
> > > > > label).>
> > > > >
> > > > > Examples please (I've only had one cup of coffee so far this
> > > morning).> >
> > > > >
> > > >
> > > >
> > > >  Supposed colum families are : {'content:', 'type:'}
> > > > I want to insert a new record with url www.google.com at date
> > > 20090218 :
> > > >
> > > > Case 1:
> > > > BactUpdate update = new BacthUpdate(www.google.com);
> > > > update.put('content:20090218',
> > > > 
> 1ffe36e5b13f28e69c2886f40fd3fcea2ce05d030b508c11d714dead5d69000f);> 
> > > update.put('type:20090218', 'text/html');
> > > > table.commit(update);
> > > >
> > > > Case 2: Implies use hbase versioning
> > > > BactUpdate update = new BacthUpdate(www.google.com,
> > > toTimestamp(20090218> ));
> > > > update.put('content:',
> > > > 
> 1ffe36e5b13f28e69c2886f40fd3fcea2ce05d030b508c11d714dead5d69000f);> 
> > > update.put('type:', 'text/html');
> > > > table.commit(update);
> > >
> > >
> > >
> > > I like this schema best.
> > >
> > > But both case 1 and 2 will have issues in current hbase if
> > > thousands of
> > > versions (to be fixed in 0.20.0).  Just a heads up.
> > >
> > >
> > > >
> > > > Case3:
> > > > BactUpdate update = new BacthUpdate(www.google.com@20090218);
> > > > update.put('content:',
> > > > 
> 1ffe36e5b13f28e69c2886f40fd3fcea2ce05d030b508c11d714dead5d69000f);> 
> > > update.put('type:', 'text/html');
> > > > table.commit(update);
> > > >
> > >
> > >
> > > This will work fine in current hbase, even if thousands of 
> versions.> >
> > >
> > > Is it possible (or will it be) to load column names without 
> load cell
> > > > content ? Same questions for the timestamp ?
> > > >
> > >
> > > Cell has to have something in it.
> > >
> > > Or do you mean query hbase to find list of columns in a row 
> without> > returning data?  If the latter is your question, no, 
> there is no
> > > way to get
> > > listing without getting the payload too.
> > >
> > > St.Ack
> > >
> >
> 

Re: Re : Re: Re : Re: Table design question

Posted by stack <st...@duboce.net>.
getClosestRowBefore should work.  What are you supplying for row?  The
column you ask for exists?

What happens if you open a scanner at the (non-existent) row 'www.google.com
@'?

St.Ack

On Fri, Feb 27, 2009 at 8:02 AM, <jt...@ina.fr> wrote:

> Hi,
>
> following the discussion with Stack, I have modified the way I insert data
> in hbase.
>
> Now, I insert data in an htable using url@date as row key.
> Like this :
>
> Case3:
> BactUpdate update = new BacthUpdate(www.google.com@20090218);
> update.put('content:',
> 1ffe36e5b13f28e69c2886f40fd3fcea2ce05d030b508c11d714dead5d69000f);
> update.put('type:', 'text/html');
> table.commit(update);
>
> I want to access this rows but with inexact keys. If i have inserted these
> rows :
>
> www.google.com@200801
> www.google.com@200901
> www.google.com@201001
>
> and make this request :
>
> www.google.com@200902, I would like to find the row with the specified url
> at the closest date from 200902 (www.google.com@200901 in my case)
>
> So, I thought i could use the method : HTable.getClosestRowBefore(byte[]
> row, byte[] column) to identify a row which the key is less than the
> requested one, and then scan to identify precisely the good row.
>
>
> In fact, this methods returns always the row with the null key if I request
> a row that doesn't exactly match an inserted one.
>
> Is there really a way to make this kind of request in hbase ?
>
> Jérôme Thièvre
>
>
>
>
>
> ----- Message d'origine -----
> De: stack <st...@duboce.net>
> Date: Mercredi, Février 18, 2009 10:48 pm
> Objet: Re: Re : Re: Table design question
>
> > On Wed, Feb 18, 2009 at 10:29 AM, <jt...@ina.fr> wrote:
> >
> > > >
> > > > Currently we can only return records at an explicit date or
> > older, not
> > > > newer.
> > > >
> > > >
> > > > Each record is made of 10 columns, and each insert is of the type;
> > > > >
> > > > > insertRecord(url, date, record);
> > > > >
> > > > > There are several possible designs for my record table :
> > > > >
> > > > > 1. RowKey= url and all columns are labelled with the same date.
> > > >
> > > > 2. RowKey=url and we use timestamp and version support of hbase,
> > > > and columns
> > > > > names are columnFamily names (no label).
> > > > >
> > > > 3. RowKey=url+date, and columns names are columnFamily names (no
> > > > label).>
> > > >
> > > > Examples please (I've only had one cup of coffee so far this
> > morning).> >
> > > >
> > >
> > >
> > >  Supposed colum families are : {'content:', 'type:'}
> > > I want to insert a new record with url www.google.com at date
> > 20090218 :
> > >
> > > Case 1:
> > > BactUpdate update = new BacthUpdate(www.google.com);
> > > update.put('content:20090218',
> > > 1ffe36e5b13f28e69c2886f40fd3fcea2ce05d030b508c11d714dead5d69000f);
> > > update.put('type:20090218', 'text/html');
> > > table.commit(update);
> > >
> > > Case 2: Implies use hbase versioning
> > > BactUpdate update = new BacthUpdate(www.google.com,
> > toTimestamp(20090218> ));
> > > update.put('content:',
> > > 1ffe36e5b13f28e69c2886f40fd3fcea2ce05d030b508c11d714dead5d69000f);
> > > update.put('type:', 'text/html');
> > > table.commit(update);
> >
> >
> >
> > I like this schema best.
> >
> > But both case 1 and 2 will have issues in current hbase if
> > thousands of
> > versions (to be fixed in 0.20.0).  Just a heads up.
> >
> >
> > >
> > > Case3:
> > > BactUpdate update = new BacthUpdate(www.google.com@20090218);
> > > update.put('content:',
> > > 1ffe36e5b13f28e69c2886f40fd3fcea2ce05d030b508c11d714dead5d69000f);
> > > update.put('type:', 'text/html');
> > > table.commit(update);
> > >
> >
> >
> > This will work fine in current hbase, even if thousands of versions.
> >
> >
> > Is it possible (or will it be) to load column names without load cell
> > > content ? Same questions for the timestamp ?
> > >
> >
> > Cell has to have something in it.
> >
> > Or do you mean query hbase to find list of columns in a row without
> > returning data?  If the latter is your question, no, there is no
> > way to get
> > listing without getting the payload too.
> >
> > St.Ack
> >
>

Re : Re: Re : Re: Table design question

Posted by jt...@ina.fr.
Hi,

following the discussion with Stack, I have modified the way I insert data in hbase.

Now, I insert data in an htable using url@date as row key.
Like this :

Case3:
BactUpdate update = new BacthUpdate(www.google.com@20090218);
update.put('content:',
1ffe36e5b13f28e69c2886f40fd3fcea2ce05d030b508c11d714dead5d69000f);
update.put('type:', 'text/html');
table.commit(update);

I want to access this rows but with inexact keys. If i have inserted these rows :

www.google.com@200801
www.google.com@200901
www.google.com@201001

and make this request :

www.google.com@200902, I would like to find the row with the specified url at the closest date from 200902 (www.google.com@200901 in my case)

So, I thought i could use the method : HTable.getClosestRowBefore(byte[] row, byte[] column) to identify a row which the key is less than the requested one, and then scan to identify precisely the good row.


In fact, this methods returns always the row with the null key if I request a row that doesn't exactly match an inserted one.

Is there really a way to make this kind of request in hbase ?

Jérôme Thièvre





----- Message d'origine -----
De: stack <st...@duboce.net>
Date: Mercredi, Février 18, 2009 10:48 pm
Objet: Re: Re : Re: Table design question

> On Wed, Feb 18, 2009 at 10:29 AM, <jt...@ina.fr> wrote:
> 
> > >
> > > Currently we can only return records at an explicit date or 
> older, not
> > > newer.
> > >
> > >
> > > Each record is made of 10 columns, and each insert is of the type;
> > > >
> > > > insertRecord(url, date, record);
> > > >
> > > > There are several possible designs for my record table :
> > > >
> > > > 1. RowKey= url and all columns are labelled with the same date.
> > >
> > > 2. RowKey=url and we use timestamp and version support of hbase,
> > > and columns
> > > > names are columnFamily names (no label).
> > > >
> > > 3. RowKey=url+date, and columns names are columnFamily names (no
> > > label).>
> > >
> > > Examples please (I've only had one cup of coffee so far this 
> morning).> >
> > >
> >
> >
> >  Supposed colum families are : {'content:', 'type:'}
> > I want to insert a new record with url www.google.com at date 
> 20090218 :
> >
> > Case 1:
> > BactUpdate update = new BacthUpdate(www.google.com);
> > update.put('content:20090218',
> > 1ffe36e5b13f28e69c2886f40fd3fcea2ce05d030b508c11d714dead5d69000f);
> > update.put('type:20090218', 'text/html');
> > table.commit(update);
> >
> > Case 2: Implies use hbase versioning
> > BactUpdate update = new BacthUpdate(www.google.com, 
> toTimestamp(20090218> ));
> > update.put('content:',
> > 1ffe36e5b13f28e69c2886f40fd3fcea2ce05d030b508c11d714dead5d69000f);
> > update.put('type:', 'text/html');
> > table.commit(update);
> 
> 
> 
> I like this schema best.
> 
> But both case 1 and 2 will have issues in current hbase if 
> thousands of
> versions (to be fixed in 0.20.0).  Just a heads up.
> 
> 
> >
> > Case3:
> > BactUpdate update = new BacthUpdate(www.google.com@20090218);
> > update.put('content:',
> > 1ffe36e5b13f28e69c2886f40fd3fcea2ce05d030b508c11d714dead5d69000f);
> > update.put('type:', 'text/html');
> > table.commit(update);
> >
> 
> 
> This will work fine in current hbase, even if thousands of versions.
> 
> 
> Is it possible (or will it be) to load column names without load cell
> > content ? Same questions for the timestamp ?
> >
> 
> Cell has to have something in it.
> 
> Or do you mean query hbase to find list of columns in a row without
> returning data?  If the latter is your question, no, there is no 
> way to get
> listing without getting the payload too.
> 
> St.Ack
> 

Re: Re : Re: Table design question

Posted by stack <st...@duboce.net>.
On Wed, Feb 18, 2009 at 10:29 AM, <jt...@ina.fr> wrote:

> >
> > Currently we can only return records at an explicit date or older, not
> > newer.
> >
> >
> > Each record is made of 10 columns, and each insert is of the type;
> > >
> > > insertRecord(url, date, record);
> > >
> > > There are several possible designs for my record table :
> > >
> > > 1. RowKey= url and all columns are labelled with the same date.
> >
> > 2. RowKey=url and we use timestamp and version support of hbase,
> > and columns
> > > names are columnFamily names (no label).
> > >
> > 3. RowKey=url+date, and columns names are columnFamily names (no
> > label).>
> >
> > Examples please (I've only had one cup of coffee so far this morning).
> >
> >
>
>
>  Supposed colum families are : {'content:', 'type:'}
> I want to insert a new record with url www.google.com at date 20090218 :
>
> Case 1:
> BactUpdate update = new BacthUpdate(www.google.com);
> update.put('content:20090218',
> 1ffe36e5b13f28e69c2886f40fd3fcea2ce05d030b508c11d714dead5d69000f);
> update.put('type:20090218', 'text/html');
> table.commit(update);
>
> Case 2: Implies use hbase versioning
> BactUpdate update = new BacthUpdate(www.google.com, toTimestamp(20090218
> ));
> update.put('content:',
> 1ffe36e5b13f28e69c2886f40fd3fcea2ce05d030b508c11d714dead5d69000f);
> update.put('type:', 'text/html');
> table.commit(update);



I like this schema best.

But both case 1 and 2 will have issues in current hbase if thousands of
versions (to be fixed in 0.20.0).  Just a heads up.


>
> Case3:
> BactUpdate update = new BacthUpdate(www.google.com@20090218);
> update.put('content:',
> 1ffe36e5b13f28e69c2886f40fd3fcea2ce05d030b508c11d714dead5d69000f);
> update.put('type:', 'text/html');
> table.commit(update);
>


This will work fine in current hbase, even if thousands of versions.


Is it possible (or will it be) to load column names without load cell
> content ? Same questions for the timestamp ?
>

Cell has to have something in it.

Or do you mean query hbase to find list of columns in a row without
returning data?  If the latter is your question, no, there is no way to get
listing without getting the payload too.

St.Ack