You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Buntu Dev <bu...@gmail.com> on 2015/08/27 20:58:47 UTC

HBase schema design

I'm planning on writing a time series of user action events including user
profile, attributes and product purchase transactions to answer these
questions/queries:

- What are the events leading up to the users conversion ie, purchase?
- What the different attributes that changed over a given time period?
- What is the LTV of a given user?
- Retrieve list of attributes set/enabled for given user at some point in
time.


As a newbie to HBase, I wanted to confirm that tall table design ie, with
row key <userid>_<timestamp> is _not_ the right design due to these reasons:

* scanning for the latest state of user seems to be an expensive operation
since not all the columns will be available in the latest event for the user

* constructing a row key always requires timestamp to the appended if I'm
not using the regex filtering

* fetching the user at some point in time t1 involves fetching all the
"<userid>*" rows and looking up the row with timestamp <= t1


Are these valid concerns?

Thanks!

Re: HBase schema design

Posted by Buntu Dev <bu...@gmail.com>.
Thanks Vladimir. In the case where I need a chronological order of events,
I always need retrieve all the rows "<userid>*" rows or are there any other
alternatives ways to design the row key?

Thanks again!



On Thu, Aug 27, 2015 at 12:10 PM, Vladimir Rodionov <vl...@gmail.com>
wrote:

> <userid>_<reverse_timestamp> is better (Long.MAX_VALUE - time) - most
> recent events will come first during scan. This will allow you to do
> efficient time range queries by user_id and start and end time.
>
> -Vlad
>
> On Thu, Aug 27, 2015 at 11:58 AM, Buntu Dev <bu...@gmail.com> wrote:
>
> > I'm planning on writing a time series of user action events including
> user
> > profile, attributes and product purchase transactions to answer these
> > questions/queries:
> >
> > - What are the events leading up to the users conversion ie, purchase?
> > - What the different attributes that changed over a given time period?
> > - What is the LTV of a given user?
> > - Retrieve list of attributes set/enabled for given user at some point in
> > time.
> >
> >
> > As a newbie to HBase, I wanted to confirm that tall table design ie, with
> > row key <userid>_<timestamp> is _not_ the right design due to these
> > reasons:
> >
> > * scanning for the latest state of user seems to be an expensive
> operation
> > since not all the columns will be available in the latest event for the
> > user
> >
> > * constructing a row key always requires timestamp to the appended if I'm
> > not using the regex filtering
> >
> > * fetching the user at some point in time t1 involves fetching all the
> > "<userid>*" rows and looking up the row with timestamp <= t1
> >
> >
> > Are these valid concerns?
> >
> > Thanks!
> >
>

Re: HBase schema design

Posted by Vladimir Rodionov <vl...@gmail.com>.
<userid>_<reverse_timestamp> is better (Long.MAX_VALUE - time) - most
recent events will come first during scan. This will allow you to do
efficient time range queries by user_id and start and end time.

-Vlad

On Thu, Aug 27, 2015 at 11:58 AM, Buntu Dev <bu...@gmail.com> wrote:

> I'm planning on writing a time series of user action events including user
> profile, attributes and product purchase transactions to answer these
> questions/queries:
>
> - What are the events leading up to the users conversion ie, purchase?
> - What the different attributes that changed over a given time period?
> - What is the LTV of a given user?
> - Retrieve list of attributes set/enabled for given user at some point in
> time.
>
>
> As a newbie to HBase, I wanted to confirm that tall table design ie, with
> row key <userid>_<timestamp> is _not_ the right design due to these
> reasons:
>
> * scanning for the latest state of user seems to be an expensive operation
> since not all the columns will be available in the latest event for the
> user
>
> * constructing a row key always requires timestamp to the appended if I'm
> not using the regex filtering
>
> * fetching the user at some point in time t1 involves fetching all the
> "<userid>*" rows and looking up the row with timestamp <= t1
>
>
> Are these valid concerns?
>
> Thanks!
>