You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Igor Lautar <ig...@gmail.com> on 2012/02/14 13:48:49 UTC

investigating replacing RDBMS with HBase based solution - spliting daily data inflow?

Hi All,

I'm doing an investigation in performance and scalability improvements for
one of solutions. I'm currently in a phase where I try to understand if
HBase (+MapReduce) could provide the scalability needed.

This is the current situation:
 - assume daily inflow of 10 GB of data (20+ milion rows)
 - daily job running on top of daily data
 - monthly job running on top of monthly data
 - random access to small amount of data going back in time for longer
periods (assume a year)

Now the HBase questions:
1) how would one approach splitting the data on nodes?
Considering the daily MapReduce job it would have to run, it would be best
to do separate data on daily basis?
Is this possible with single table or would it make sense to have 1 table
per day (or similar)?
I did some investigation on this and it seems one could implement custom
getSplits() to map only part in table containing daily data?

Monthly job then just reuses the same data as daily, but it has to go
through all days in month.

2) random access case
Is this feasible with HBase at all? There could be something like
few million random read requests going back a year in time. Note that
certain amount of latency is not of a big issue as reads are done for
independent operations.

There are plans to support larger amounts of data. My thinking is that
first 3 points could scale very good horizontally, what about random reads?

Regards,
Igor

Re: investigating replacing RDBMS with HBase based solution - spliting daily data inflow?

Posted by Igor Lautar <ig...@gmail.com>.

Hi,

I did look more into this and have a better idea how it could be
implemented.

As values are looked-up by dates (and sometimes additionally by source ID),
it would make sense to store each value in separate row.
rowkey would be some kind of timeseries, like:
timestamp_sourceID

However, docs suggest this is a bad idea as all inserts go to only one
region at the time (as rowkeys have same/increasing begging).
I have taken a look @ OpenTSDB schema where metric ID (or source ID in this
case) is stored first, followed by timestamp (albeit 10m granularity, they
store exact time details in columns). However, their scans know metric ID
(at least this is what I saw by a quick look @ the code - please correct me
if I'm wrong) for which scan is done, which we do not.

In our case, we want to utilize hbase ability to do scans on partial keys
to get all rows for specific day (or year/month).
Assuming timestamp format is YYYY-MM-DDTHH:MM:SS (ignore the length of
rowkey for purpose of discussion), we could scan for
YYYY
YYYY-MM
YYYY-MM-DD
etc.

How can the same scan effeteness be achieved (i.e., not scanning the whole
table and ignoring older/newer timestamps) if timestamp is not @ begging of
rowkey?

Regards,
Igor

On Tue, Feb 14, 2012 at 1:48 PM, Igor Lautar <ig...@gmail.com> wrote:

> Hi All,
>
> I'm doing an investigation in performance and scalability improvements for
> one of solutions. I'm currently in a phase where I try to understand if
> HBase (+MapReduce) could provide the scalability needed.
>
> This is the current situation:
>  - assume daily inflow of 10 GB of data (20+ milion rows)
>  - daily job running on top of daily data
>  - monthly job running on top of monthly data
>  - random access to small amount of data going back in time for longer
> periods (assume a year)
>
> Now the HBase questions:
> 1) how would one approach splitting the data on nodes?
> Considering the daily MapReduce job it would have to run, it would be best
> to do separate data on daily basis?
> Is this possible with single table or would it make sense to have 1 table
> per day (or similar)?
> I did some investigation on this and it seems one could implement custom
> getSplits() to map only part in table containing daily data?
>
> Monthly job then just reuses the same data as daily, but it has to go
> through all days in month.
>
> 2) random access case
> Is this feasible with HBase at all? There could be something like
> few million random read requests going back a year in time. Note that
> certain amount of latency is not of a big issue as reads are done for
> independent operations.
>
> There are plans to support larger amounts of data. My thinking is that
> first 3 points could scale very good horizontally, what about random reads?
>
> Regards,
> Igor
>