You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Mark <st...@gmail.com> on 2011/08/13 20:00:39 UTC

Generic Schema Question

Hi all, I'm trying to wrap my head around HBase schema design and I am 
having trouble modeling the following use case:

We store all our use behavior (clicks, searches, page views) in Hadoop 
and we would like to add this into HBase so we can interactively 
"explore" what our users are doing. For example we would like, given an 
IP address get back a list of all searches, page views, clicks etc that 
this user has attempted.

My initial thought for something like this would be to create a table 
"Logs" with a CF "Data" that have qualifiers of "Search", "Click" and 
"View". Each column would have a row with the IP as its key.

Is this along the right lines or am I missing something... sure feels 
like I am. Would anyone please explain how I would accomplish what I am 
looking for.

Thanks

RE: Generic Schema Question

Posted by "Buttler, David" <bu...@llnl.gov>.
If you are interested in the most recent 100 transactions, instead of using currentTimeMillis() as part of your key, you can use Long.MAX_VALUE-System.currentTimeMillis().  That way new entries get put at the top.  Then you can have a start row of your scan to be "192.168.1.2" and the first result will be the most recent entry.  You can then just scan for 100 rows and get all of what you want.

Dave

-----Original Message-----
From: Mark [mailto:static.void.dev@gmail.com] 
Sent: Saturday, August 13, 2011 5:16 PM
To: user@hbase.apache.org
Subject: Re: Generic Schema Question

Ok so something like this?

row                                   cf:qual           value
-----------------------------------------
192.168.1.2/1313280451 data:page      "/foo/bar"
192.168.1.2/1313280451 data:referrer  "google.com"
192.168.1.2/1313280451 data:session  "f306e5af69b48568323fdc3018e40e7e"

-----------------------------------------
192.168.1.2/1313281242 data:page "/foo/baz"
192.168.1.2/1313281242 data:page ""
192.168.1.2/1313281242 data:page "f306e5af69b48568323fdc3018e40e7e"
....

Will this allow me to query the last 100 rows for ip "192.168.1.2". If 
so, how? Will it be efficient? Also, would you mind explaining an 
alternative way of accomplishing this as I'm still trying to figure out 
all the possibilities.

Thanks again


On 8/13/11 4:53 PM, Blake Lemoine wrote:
> You need to have the ip address followed by a slash followed by the time as
> the row key.  Or some other such a way of getting multiple rows per ip.
> Then you could scan for the ip prefix.  Of course that's just one possible
> solution.
> On Aug 13, 2011 1:01 PM, "Mark"<st...@gmail.com>  wrote:
>> Hi all, I'm trying to wrap my head around HBase schema design and I am
>> having trouble modeling the following use case:
>>
>> We store all our use behavior (clicks, searches, page views) in Hadoop
>> and we would like to add this into HBase so we can interactively
>> "explore" what our users are doing. For example we would like, given an
>> IP address get back a list of all searches, page views, clicks etc that
>> this user has attempted.
>>
>> My initial thought for something like this would be to create a table
>> "Logs" with a CF "Data" that have qualifiers of "Search", "Click" and
>> "View". Each column would have a row with the IP as its key.
>>
>> Is this along the right lines or am I missing something... sure feels
>> like I am. Would anyone please explain how I would accomplish what I am
>> looking for.
>>
>> Thanks

Re: Generic Schema Question

Posted by Li Pi <li...@cloudera.com>.
You can do a range scan for 192.168.1.2/1313280451 to 192.168.1.2/1313281242
.

Do setbatch to 100.

Alternatively, you can just use the IP as the key alone, and let hbase keep
track of versions. Set maxversions to an Integer.MAX when creating the
column, and just do a get of 192.168.1.2 with
*setMaxVersions<http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Get.html#setMaxVersions(int)>
*(int maxVersions) with maxversions = 100.


On Sat, Aug 13, 2011 at 5:16 PM, Mark <st...@gmail.com> wrote:

> Ok so something like this?
>
> row                                   cf:qual           value
> ------------------------------**-----------
> 192.168.1.2/1313280451 data:page      "/foo/bar"
> 192.168.1.2/1313280451 data:referrer  "google.com"
> 192.168.1.2/1313280451 data:session  "**f306e5af69b48568323fdc3018e40e**
> 7e"
>
> ------------------------------**-----------
> 192.168.1.2/1313281242 data:page "/foo/baz"
> 192.168.1.2/1313281242 data:page ""
> 192.168.1.2/1313281242 data:page "**f306e5af69b48568323fdc3018e40e**7e"
> ....
>
> Will this allow me to query the last 100 rows for ip "192.168.1.2". If so,
> how? Will it be efficient? Also, would you mind explaining an alternative
> way of accomplishing this as I'm still trying to figure out all the
> possibilities.
>
> Thanks again
>
>
>
> On 8/13/11 4:53 PM, Blake Lemoine wrote:
>
>> You need to have the ip address followed by a slash followed by the time
>> as
>> the row key.  Or some other such a way of getting multiple rows per ip.
>> Then you could scan for the ip prefix.  Of course that's just one possible
>> solution.
>> On Aug 13, 2011 1:01 PM, "Mark"<st...@gmail.com>>
>>  wrote:
>>
>>> Hi all, I'm trying to wrap my head around HBase schema design and I am
>>> having trouble modeling the following use case:
>>>
>>> We store all our use behavior (clicks, searches, page views) in Hadoop
>>> and we would like to add this into HBase so we can interactively
>>> "explore" what our users are doing. For example we would like, given an
>>> IP address get back a list of all searches, page views, clicks etc that
>>> this user has attempted.
>>>
>>> My initial thought for something like this would be to create a table
>>> "Logs" with a CF "Data" that have qualifiers of "Search", "Click" and
>>> "View". Each column would have a row with the IP as its key.
>>>
>>> Is this along the right lines or am I missing something... sure feels
>>> like I am. Would anyone please explain how I would accomplish what I am
>>> looking for.
>>>
>>> Thanks
>>>
>>

Re: Generic Schema Question

Posted by Doug Meil <do...@explorysmedical.com>.
See this section in the Hbase book...

11.6.3. Close ResultScanners


There is a snippet of how to use a Scan, which is what you'd what for that.


I just realized that there should be a better Scan example in the Data
Model chapter.  I'll add it.

Doug Meil
Chief Software Architect, Explorys
doug.meil@explorys.com






On 8/13/11 8:16 PM, "Mark" <st...@gmail.com> wrote:

>Ok so something like this?
>
>row                                   cf:qual           value
>-----------------------------------------
>192.168.1.2/1313280451 data:page      "/foo/bar"
>192.168.1.2/1313280451 data:referrer  "google.com"
>192.168.1.2/1313280451 data:session  "f306e5af69b48568323fdc3018e40e7e"
>
>-----------------------------------------
>192.168.1.2/1313281242 data:page "/foo/baz"
>192.168.1.2/1313281242 data:page ""
>192.168.1.2/1313281242 data:page "f306e5af69b48568323fdc3018e40e7e"
>....
>
>Will this allow me to query the last 100 rows for ip "192.168.1.2". If
>so, how? Will it be efficient? Also, would you mind explaining an
>alternative way of accomplishing this as I'm still trying to figure out
>all the possibilities.
>
>Thanks again
>
>
>On 8/13/11 4:53 PM, Blake Lemoine wrote:
>> You need to have the ip address followed by a slash followed by the
>>time as
>> the row key.  Or some other such a way of getting multiple rows per ip.
>> Then you could scan for the ip prefix.  Of course that's just one
>>possible
>> solution.
>> On Aug 13, 2011 1:01 PM, "Mark"<st...@gmail.com>  wrote:
>>> Hi all, I'm trying to wrap my head around HBase schema design and I am
>>> having trouble modeling the following use case:
>>>
>>> We store all our use behavior (clicks, searches, page views) in Hadoop
>>> and we would like to add this into HBase so we can interactively
>>> "explore" what our users are doing. For example we would like, given an
>>> IP address get back a list of all searches, page views, clicks etc that
>>> this user has attempted.
>>>
>>> My initial thought for something like this would be to create a table
>>> "Logs" with a CF "Data" that have qualifiers of "Search", "Click" and
>>> "View". Each column would have a row with the IP as its key.
>>>
>>> Is this along the right lines or am I missing something... sure feels
>>> like I am. Would anyone please explain how I would accomplish what I am
>>> looking for.
>>>
>>> Thanks


Re: Generic Schema Question

Posted by Mark <st...@gmail.com>.
Ok so something like this?

row                                   cf:qual           value
-----------------------------------------
192.168.1.2/1313280451 data:page      "/foo/bar"
192.168.1.2/1313280451 data:referrer  "google.com"
192.168.1.2/1313280451 data:session  "f306e5af69b48568323fdc3018e40e7e"

-----------------------------------------
192.168.1.2/1313281242 data:page "/foo/baz"
192.168.1.2/1313281242 data:page ""
192.168.1.2/1313281242 data:page "f306e5af69b48568323fdc3018e40e7e"
....

Will this allow me to query the last 100 rows for ip "192.168.1.2". If 
so, how? Will it be efficient? Also, would you mind explaining an 
alternative way of accomplishing this as I'm still trying to figure out 
all the possibilities.

Thanks again


On 8/13/11 4:53 PM, Blake Lemoine wrote:
> You need to have the ip address followed by a slash followed by the time as
> the row key.  Or some other such a way of getting multiple rows per ip.
> Then you could scan for the ip prefix.  Of course that's just one possible
> solution.
> On Aug 13, 2011 1:01 PM, "Mark"<st...@gmail.com>  wrote:
>> Hi all, I'm trying to wrap my head around HBase schema design and I am
>> having trouble modeling the following use case:
>>
>> We store all our use behavior (clicks, searches, page views) in Hadoop
>> and we would like to add this into HBase so we can interactively
>> "explore" what our users are doing. For example we would like, given an
>> IP address get back a list of all searches, page views, clicks etc that
>> this user has attempted.
>>
>> My initial thought for something like this would be to create a table
>> "Logs" with a CF "Data" that have qualifiers of "Search", "Click" and
>> "View". Each column would have a row with the IP as its key.
>>
>> Is this along the right lines or am I missing something... sure feels
>> like I am. Would anyone please explain how I would accomplish what I am
>> looking for.
>>
>> Thanks

Re: Generic Schema Question

Posted by Blake Lemoine <ba...@gmail.com>.
You need to have the ip address followed by a slash followed by the time as
the row key.  Or some other such a way of getting multiple rows per ip.
Then you could scan for the ip prefix.  Of course that's just one possible
solution.
On Aug 13, 2011 1:01 PM, "Mark" <st...@gmail.com> wrote:
> Hi all, I'm trying to wrap my head around HBase schema design and I am
> having trouble modeling the following use case:
>
> We store all our use behavior (clicks, searches, page views) in Hadoop
> and we would like to add this into HBase so we can interactively
> "explore" what our users are doing. For example we would like, given an
> IP address get back a list of all searches, page views, clicks etc that
> this user has attempted.
>
> My initial thought for something like this would be to create a table
> "Logs" with a CF "Data" that have qualifiers of "Search", "Click" and
> "View". Each column would have a row with the IP as its key.
>
> Is this along the right lines or am I missing something... sure feels
> like I am. Would anyone please explain how I would accomplish what I am
> looking for.
>
> Thanks