You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Prosperent <pr...@gmail.com> on 2011/04/12 00:10:19 UTC

hbase architecture question

We're new to hbase, but somewhat familiar with the core concepts associated
with it. We use mysql now, but have also used cassandra for portions of our
code. We feel that hbase is a better fit because of the tight integration
with mapreduce and the proven stability of the underlying hadoop system. 

We run an advertising network in which we collect several thousand pieces of
analytical data per second. This obviously scales poorly in mysql. Our
initial gut feeling is to do something like the following with hbase. Let me
know if we are on the right track.

Aggregate our detailed raw stats into hbase that contain all of our verbose
data. From here, we can run mapreduce jobs and create hourly, daily,
monthly, etc rollups of our data as it is needed for our different front end
interfaces. Store it in such a way that it is formatted how we need it so we
don't have to do any further processing on it when we hit display time. This
would also give us the flexibility to create new views with new rollup
metrics since we stored all of our raw data and can again mapreduce it
anyway we need it. 

For simple graphs and a more realtime view of simple data like clicks and
impressions we thought about simply auto incrementing hourly, daily, monthly
counters for a user or revenue channel. 

The other consideration is getting the data into hbase. We were looking at
adding variables to our url's so we can aggregate the apache logs from each
of our front end application servers. That or we can simply do the inserts
straight into hbase using php and thrift. I'm guessing the first scenario is
more efficient speed wise, but again, I may be overlooking other issues.

Does this basic data strategy sound solid? Any suggestions, or potential
pitfalls? I would love some advice from those more seasoned in handling
large volume analytical datasets. 

Thanks guys

Brian
-- 
View this message in context: http://old.nabble.com/hbase-architecture-question-tp31374398p31374398.html
Sent from the HBase User mailing list archive at Nabble.com.


RE: hbase architecture question

Posted by Prosperent <pr...@gmail.com>.
The plan was to have the map reduce job run on our schedule (hourly, daily,
monthly) and populate these rollups so we aren't having to do any processing
on the data in hbase. When a user requests stats, we just pull back the
already compiled data from the rollups. It isn't realtime this way, but we
avoid the i/o issues that you pointed out. 


Lyman Do wrote:
> 
> It depends on how many concurrent users on the BI frond end, if each of
> them will fire off a MR job for their BI queries, which likely resulting
> in a scan or partial scan on HBase, this may put too much stress on the IO
> sub-system.  
> 
> If you have the data access pattern of your BI users, you may want to
> pre-aggregate some into MySQL in form of a data mart which is more
> flexible for slice and dice queries. Leave the MR jobs for ad hoc and non-
> or semi-aggregate data analysis.
> 
> -----Original Message-----
> From: Prosperent [mailto:prosperent1@gmail.com] 
> Sent: Monday, April 11, 2011 3:10 PM
> To: hbase-user@hadoop.apache.org
> Subject: hbase architecture question
> 
> 
> We're new to hbase, but somewhat familiar with the core concepts
> associated
> with it. We use mysql now, but have also used cassandra for portions of
> our
> code. We feel that hbase is a better fit because of the tight integration
> with mapreduce and the proven stability of the underlying hadoop system. 
> 
> We run an advertising network in which we collect several thousand pieces
> of
> analytical data per second. This obviously scales poorly in mysql. Our
> initial gut feeling is to do something like the following with hbase. Let
> me
> know if we are on the right track.
> 
> Aggregate our detailed raw stats into hbase that contain all of our
> verbose
> data. From here, we can run mapreduce jobs and create hourly, daily,
> monthly, etc rollups of our data as it is needed for our different front
> end
> interfaces. Store it in such a way that it is formatted how we need it so
> we
> don't have to do any further processing on it when we hit display time.
> This
> would also give us the flexibility to create new views with new rollup
> metrics since we stored all of our raw data and can again mapreduce it
> anyway we need it. 
> 
> For simple graphs and a more realtime view of simple data like clicks and
> impressions we thought about simply auto incrementing hourly, daily,
> monthly
> counters for a user or revenue channel. 
> 
> The other consideration is getting the data into hbase. We were looking at
> adding variables to our url's so we can aggregate the apache logs from
> each
> of our front end application servers. That or we can simply do the inserts
> straight into hbase using php and thrift. I'm guessing the first scenario
> is
> more efficient speed wise, but again, I may be overlooking other issues.
> 
> Does this basic data strategy sound solid? Any suggestions, or potential
> pitfalls? I would love some advice from those more seasoned in handling
> large volume analytical datasets. 
> 
> Thanks guys
> 
> Brian
> -- 
> View this message in context:
> http://old.nabble.com/hbase-architecture-question-tp31374398p31374398.html
> Sent from the HBase User mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/hbase-architecture-question-tp31374398p31380826.html
Sent from the HBase User mailing list archive at Nabble.com.


RE: hbase architecture question

Posted by Lyman Do <ld...@visibletechnologies.com>.
It depends on how many concurrent users on the BI frond end, if each of them will fire off a MR job for their BI queries, which likely resulting in a scan or partial scan on HBase, this may put too much stress on the IO sub-system.  

If you have the data access pattern of your BI users, you may want to pre-aggregate some into MySQL in form of a data mart which is more flexible for slice and dice queries. Leave the MR jobs for ad hoc and non- or semi-aggregate data analysis.

-----Original Message-----
From: Prosperent [mailto:prosperent1@gmail.com] 
Sent: Monday, April 11, 2011 3:10 PM
To: hbase-user@hadoop.apache.org
Subject: hbase architecture question


We're new to hbase, but somewhat familiar with the core concepts associated
with it. We use mysql now, but have also used cassandra for portions of our
code. We feel that hbase is a better fit because of the tight integration
with mapreduce and the proven stability of the underlying hadoop system. 

We run an advertising network in which we collect several thousand pieces of
analytical data per second. This obviously scales poorly in mysql. Our
initial gut feeling is to do something like the following with hbase. Let me
know if we are on the right track.

Aggregate our detailed raw stats into hbase that contain all of our verbose
data. From here, we can run mapreduce jobs and create hourly, daily,
monthly, etc rollups of our data as it is needed for our different front end
interfaces. Store it in such a way that it is formatted how we need it so we
don't have to do any further processing on it when we hit display time. This
would also give us the flexibility to create new views with new rollup
metrics since we stored all of our raw data and can again mapreduce it
anyway we need it. 

For simple graphs and a more realtime view of simple data like clicks and
impressions we thought about simply auto incrementing hourly, daily, monthly
counters for a user or revenue channel. 

The other consideration is getting the data into hbase. We were looking at
adding variables to our url's so we can aggregate the apache logs from each
of our front end application servers. That or we can simply do the inserts
straight into hbase using php and thrift. I'm guessing the first scenario is
more efficient speed wise, but again, I may be overlooking other issues.

Does this basic data strategy sound solid? Any suggestions, or potential
pitfalls? I would love some advice from those more seasoned in handling
large volume analytical datasets. 

Thanks guys

Brian
-- 
View this message in context: http://old.nabble.com/hbase-architecture-question-tp31374398p31374398.html
Sent from the HBase User mailing list archive at Nabble.com.


Re: hbase architecture question

Posted by ja...@cox.net.
This is basically what I do only I use a Java Client to aggregate and place the data into HBase. I can process a log with a million rows in a little over 13 seconds. To write the data to HBase takes around 40 seconds. Then we hit HBase via a thin client a SpringWS. Seems to work pretty well.

-Pete


---- Prosperent <pr...@gmail.com> wrote: 

=============

We're new to hbase, but somewhat familiar with the core concepts associated
with it. We use mysql now, but have also used cassandra for portions of our
code. We feel that hbase is a better fit because of the tight integration
with mapreduce and the proven stability of the underlying hadoop system. 

We run an advertising network in which we collect several thousand pieces of
analytical data per second. This obviously scales poorly in mysql. Our
initial gut feeling is to do something like the following with hbase. Let me
know if we are on the right track.

Aggregate our detailed raw stats into hbase that contain all of our verbose
data. From here, we can run mapreduce jobs and create hourly, daily,
monthly, etc rollups of our data as it is needed for our different front end
interfaces. Store it in such a way that it is formatted how we need it so we
don't have to do any further processing on it when we hit display time. This
would also give us the flexibility to create new views with new rollup
metrics since we stored all of our raw data and can again mapreduce it
anyway we need it. 

For simple graphs and a more realtime view of simple data like clicks and
impressions we thought about simply auto incrementing hourly, daily, monthly
counters for a user or revenue channel. 

The other consideration is getting the data into hbase. We were looking at
adding variables to our url's so we can aggregate the apache logs from each
of our front end application servers. That or we can simply do the inserts
straight into hbase using php and thrift. I'm guessing the first scenario is
more efficient speed wise, but again, I may be overlooking other issues.

Does this basic data strategy sound solid? Any suggestions, or potential
pitfalls? I would love some advice from those more seasoned in handling
large volume analytical datasets. 

Thanks guys

Brian
-- 
View this message in context: http://old.nabble.com/hbase-architecture-question-tp31374398p31374398.html
Sent from the HBase User mailing list archive at Nabble.com.


--

1. If a man is standing in the middle of the forest talking, and there is no woman around to hear him, is he still wrong?

2. Behind every great woman... Is a man checking out her ass

3. I am not a member of any organized political party. I am a Democrat.*

4. Diplomacy is the art of saying "Nice doggie" until you can find a rock.*

5. A process is what you need when all your good people have left.


*Will Rogers