You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Chen Wang <ch...@gmail.com> on 2015/01/13 01:42:00 UTC

Design a datastore maintaining historical view of users.

Hey Guys,
I am seeking advice on design a system that maintains a historical view of
a user's activities in past one year. Each user can have different
activities: email_open, email_click, item_view, add_to_cart, purchase etc.
The query I would like to do is, for example,

Find all customers who browse item A in the past 6 month, and also clicked
an email.
and I would like the query to be done in reasonable time frame. (for
example, within 30 minutes to retrieve 10million such users)

Since we already have HBase cluster in place, HBase becomes my first
choice. So I can have customer_id as the row key, column family be
'Activity', then have certain attributes associated with the column
family,something like:

custer_id, browse:{item_id:12334, timestamp:epoc}

However, It seems that HBase would not be a good choice for supporting the
queries above. Even its possible with scan, it will be super inefficient
due to the size of the data set.

Is my understanding correct and I should resort to other data store.(ES in
my opinion). or has anyone done similar thing with HBase?

Thanks in advance.
Chen

Re: Design a datastore maintaining historical view of users.

Posted by Nick Dimiduk <nd...@gmail.com>.

With an entity-centric data model (i.e. customer_id as row key), you're
looking at a full table scan for every query. 30-minute SLA puts you well
within the realm of a MapReduce/Cascading/Pig/Hive/Tez/Spark job. HBase can
work fine for this, but since you're not really in the low-latency world,
perhaps you'd consider a more analytical storage system (i.e., HDFS +
ORC/Parquet). Of course, if your data is extremely sparse, you'll land back
here at HBase.

You can achieve lower latencies with HBase by pushing query components into
the row key. However, if the queries are truly adhoc, you'll probably want
secondary indices. Apache Phoenix is a great choice if you decide to pursue
this route. ES may also be a reasonable choice here, but it depends on many
other factors, including 'scale' and your philosophy about indices as a
data storage medium.

If time is a frequent component of your query patterns, I recommend you
model is directly in your schema. You'll have more flexibility and better
performance than if you rely on HBase's timestamp for this attribute.

-n

On Mon, Jan 12, 2015 at 4:42 PM, Chen Wang <ch...@gmail.com>
wrote:

> Hey Guys,
> I am seeking advice on design a system that maintains a historical view of
> a user's activities in past one year. Each user can have different
> activities: email_open, email_click, item_view, add_to_cart, purchase etc.
> The query I would like to do is, for example,
>
> Find all customers who browse item A in the past 6 month, and also clicked
> an email.
> and I would like the query to be done in reasonable time frame. (for
> example, within 30 minutes to retrieve 10million such users)
>
> Since we already have HBase cluster in place, HBase becomes my first
> choice. So I can have customer_id as the row key, column family be
> 'Activity', then have certain attributes associated with the column
> family,something like:
>
> custer_id, browse:{item_id:12334, timestamp:epoc}
>
> However, It seems that HBase would not be a good choice for supporting the
> queries above. Even its possible with scan, it will be super inefficient
> due to the size of the data set.
>
> Is my understanding correct and I should resort to other data store.(ES in
> my opinion). or has anyone done similar thing with HBase?
>
> Thanks in advance.
> Chen
>

Re: Design a datastore maintaining historical view of users.

Posted by Alok Singh <al...@gmail.com>.

Have you looked at the phoenix project? http://phoenix.apache.org/
It is an SQL layer on top of hbase, an alternative to map-reduce jobs
when ad hoc/realtime queries are needed.
Your use case seems like it would work fairly well with phoenix.

Alok

On Mon, Jan 12, 2015 at 7:06 PM, Wilm Schumacher
<wi...@gmail.com> wrote:
> Hi,
>
> I'm doing something comparable right now, but not with such a HUGE
> database O_o. 10 Mio results for such a query? This would mean that you
> have 100 Mio -> 1 Billion customers ?!?!
>
> However: in my opinion with such a huge database HBase is a good fit.
> However, your data model should be changed for that purpose.
>
> You could use something like
> <userid>-<item_id> (or the other way around)
> for a row key and
> data:clicktime => <timestamp>
> or whatever for data ....
>
> for the "item table". In this example the "e-mail" table does not have
> to be touched (I assume they are separated)
>
> 1.) Map reduce
>
> Your problem make a good point for map reduce. Just map over the e-mail
> events, when you find an e-mail that fits your question, make a get on
> the user which is connected to the e-mail and the item_id (row key
> design above). If so => emit and use an hbase table as sink. By that you
> would make sure that the user is only processed once (as the e-mail
> could be opened twice or so) and you would have a nice list in a
> separate table which you could use.
>
> For such a huge database the question of 30 min or less then is "just" a
> question of money for hbase/yarn nodes.
>
> 2.) Simple index
>
> I thought of simple indexes, too. However, this only would work properly
> if you would make a table where the item_id would be the primary key for
> the index and the users browse events would make up the column family.
>
> with <ts> the specific timestamp of the action something like
> item_1    user:alice <ts>    user:bob <ts>
> item_2    user:alice <ts>                              user:charles <ts>
>
> (hope the e-mail is rendered okay to explain what I mean)
>
> Then you could simply fetch the <item_id> row and use that to check if
> the e-mail action exist.
>
> However, this index would be a table of <number of items> rows and
> <number of users> qualifiers in one column family. Of course it would be
> sparse, but would make up by above estimate 100 Mio -> 1 Billion
> qualifiers in the main column family. I never ran a table THAT wide.
> Does anyone on this list did something like that and can comment?
>
> Furthermore you would have to iterate over the result ... which has to
> be estimated >> 10 Mio items, which sounds like a pretty bad idea.
>
> 3.) Fancy index
>
> You make up an index for exactly your question from above.
>
> Row keys would be
> <e-mail-action>-<e-mail_id>-<item_id>-<userid>
> where <e-mail-action> is "click", "erase", ....
>
> with data
> email:1 => <rowkey of the email event>
> email:2 => <rowkey of the email event>
> ...
> (as some events could occure more than once)
>
> and
> item:1 => <rowkey of the item browse event>
> item:2 => <rowkey of the item browse event>
> ...
>
> Then you would have your list right away e.g. by setting up a scan for
> the specific range (which is the answer to your specific question in
> this case).
>
> But you would have to make on every e-mail event a scan over the list of
> items the user has browsed (by design above a simple scan in the "item
> table" for the specific user) and fill the index.
>
> Which actually sounds like an even worse idea.
>
> ====
>
> But as I read your e-mail, this example is just one of the query of many
> different queries, right? And for every query you need one such index.
>
> Normally I'm in favour of indexes, but on this case I think you really
> need a clever row key design of your separate tables and the full power
> of map reduce.
>
> However, I saw your analogue postings on the other "nosql" mailing
> lists. As cassandra have a "join" feature in the cql this would be a
> good fit ... at first sight. Furthermore the querying for ElasticSearch
> can easily be used in a way which satisfy your need. However, by design
> (as I understand what c* is doing on joins) cassandra would do A LOT of
> communication for that case (i.e. inefficient), and elastic search will
> probably break down if the master would have to give you 10 Mio results
> back. Thus you would have to iterate A LOT of pages to get your full
> result, which means a lot of JSON parsing and http stuff.
>
> Even if hbase is not the best choice on the first sight, I think it
> could be the best choice nevertheless.
>
> Best wishes,
>
> Wilm
>
> ps: man ... that's a lot of data. Is that you, amazon? ;)
>
> Am 13.01.2015 um 01:42 schrieb Chen Wang:
>> Hey Guys,
>> I am seeking advice on design a system that maintains a historical view of
>> a user's activities in past one year. Each user can have different
>> activities: email_open, email_click, item_view, add_to_cart, purchase etc.
>> The query I would like to do is, for example,
>>
>> Find all customers who browse item A in the past 6 month, and also clicked
>> an email.
>> and I would like the query to be done in reasonable time frame. (for
>> example, within 30 minutes to retrieve 10million such users)
>>
>> Since we already have HBase cluster in place, HBase becomes my first
>> choice. So I can have customer_id as the row key, column family be
>> 'Activity', then have certain attributes associated with the column
>> family,something like:
>>
>> custer_id, browse:{item_id:12334, timestamp:epoc}
>>
>> However, It seems that HBase would not be a good choice for supporting the
>> queries above. Even its possible with scan, it will be super inefficient
>> due to the size of the data set.
>>
>> Is my understanding correct and I should resort to other data store.(ES in
>> my opinion). or has anyone done similar thing with HBase?
>>
>> Thanks in advance.
>> Chen
>>
>

Re: Design a datastore maintaining historical view of users.

Posted by Wilm Schumacher <wi...@gmail.com>.

Hi,

I'm doing something comparable right now, but not with such a HUGE
database O_o. 10 Mio results for such a query? This would mean that you
have 100 Mio -> 1 Billion customers ?!?!

However: in my opinion with such a huge database HBase is a good fit.
However, your data model should be changed for that purpose.

You could use something like
<userid>-<item_id> (or the other way around)
for a row key and
data:clicktime => <timestamp>
or whatever for data ....

for the "item table". In this example the "e-mail" table does not have
to be touched (I assume they are separated)

1.) Map reduce

Your problem make a good point for map reduce. Just map over the e-mail
events, when you find an e-mail that fits your question, make a get on
the user which is connected to the e-mail and the item_id (row key
design above). If so => emit and use an hbase table as sink. By that you
would make sure that the user is only processed once (as the e-mail
could be opened twice or so) and you would have a nice list in a
separate table which you could use.

For such a huge database the question of 30 min or less then is "just" a
question of money for hbase/yarn nodes.

2.) Simple index

I thought of simple indexes, too. However, this only would work properly
if you would make a table where the item_id would be the primary key for
the index and the users browse events would make up the column family.

with <ts> the specific timestamp of the action something like
item_1    user:alice <ts>    user:bob <ts>
item_2    user:alice <ts>                              user:charles <ts>

(hope the e-mail is rendered okay to explain what I mean)

Then you could simply fetch the <item_id> row and use that to check if
the e-mail action exist.

However, this index would be a table of <number of items> rows and
<number of users> qualifiers in one column family. Of course it would be
sparse, but would make up by above estimate 100 Mio -> 1 Billion
qualifiers in the main column family. I never ran a table THAT wide.
Does anyone on this list did something like that and can comment?

Furthermore you would have to iterate over the result ... which has to
be estimated >> 10 Mio items, which sounds like a pretty bad idea.

3.) Fancy index

You make up an index for exactly your question from above.

Row keys would be
<e-mail-action>-<e-mail_id>-<item_id>-<userid>
where <e-mail-action> is "click", "erase", ....

with data
email:1 => <rowkey of the email event>
email:2 => <rowkey of the email event>
...
(as some events could occure more than once)

and
item:1 => <rowkey of the item browse event>
item:2 => <rowkey of the item browse event>
...

Then you would have your list right away e.g. by setting up a scan for
the specific range (which is the answer to your specific question in
this case).

But you would have to make on every e-mail event a scan over the list of
items the user has browsed (by design above a simple scan in the "item
table" for the specific user) and fill the index.

Which actually sounds like an even worse idea.

====

But as I read your e-mail, this example is just one of the query of many
different queries, right? And for every query you need one such index.

Normally I'm in favour of indexes, but on this case I think you really
need a clever row key design of your separate tables and the full power
of map reduce.

However, I saw your analogue postings on the other "nosql" mailing
lists. As cassandra have a "join" feature in the cql this would be a
good fit ... at first sight. Furthermore the querying for ElasticSearch
can easily be used in a way which satisfy your need. However, by design
(as I understand what c* is doing on joins) cassandra would do A LOT of
communication for that case (i.e. inefficient), and elastic search will
probably break down if the master would have to give you 10 Mio results
back. Thus you would have to iterate A LOT of pages to get your full
result, which means a lot of JSON parsing and http stuff.

Even if hbase is not the best choice on the first sight, I think it
could be the best choice nevertheless.

Best wishes,

Wilm

ps: man ... that's a lot of data. Is that you, amazon? ;)

Am 13.01.2015 um 01:42 schrieb Chen Wang:
> Hey Guys,
> I am seeking advice on design a system that maintains a historical view of
> a user's activities in past one year. Each user can have different
> activities: email_open, email_click, item_view, add_to_cart, purchase etc.
> The query I would like to do is, for example,
>
> Find all customers who browse item A in the past 6 month, and also clicked
> an email.
> and I would like the query to be done in reasonable time frame. (for
> example, within 30 minutes to retrieve 10million such users)
>
> Since we already have HBase cluster in place, HBase becomes my first
> choice. So I can have customer_id as the row key, column family be
> 'Activity', then have certain attributes associated with the column
> family,something like:
>
> custer_id, browse:{item_id:12334, timestamp:epoc}
>
> However, It seems that HBase would not be a good choice for supporting the
> queries above. Even its possible with scan, it will be super inefficient
> due to the size of the data set.
>
> Is my understanding correct and I should resort to other data store.(ES in
> my opinion). or has anyone done similar thing with HBase?
>
> Thanks in advance.
> Chen
>