You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Michael Segel <mi...@hotmail.com> on 2011/04/15 22:01:29 UTC

Question about table size limitations...

I have a question concerning if or what is the practical
size limitations of a Table in Hbase.

I was asked ‘how many rows can one reasonably expect HBase
to handle…’ and the person with the question didn’t like my “It depends…”
answer. (I’m a consultant and the answer to every problem has to start with a “It
depends…” caveat. :-)
)

In trying to ascertain a practical answer, I’ve created this
hypothetical problem and hopefully someone with a bit more insight and
knowledge can provide a better answer.
Please note that this is a hypothetical example and any resemblance to a
real life problem is a coincidence.

We have a fleet of petroleum exploration vessels. Each
vessel tows a set of sonar buoys to take measurements of the ocean’s floor. Overlapping of searches can occur.
(Crisscross patterns)

So our data sets contain both a geospatial aspect along with
a time series aspect. The complete data set of a single ocean can be large.
Measured in 10's of PBs.

There are two know use cases:

·
For a given ‘sweep’ process that data set.
(Sweep is a data set for a given ship in a given grid space for a single day where we know the start and end
times of the sweep.)

·
For a given grid_id (geo spatial box) process
all of the data collected by all of the sweeps that occurred. (Different ships,
dates, etc …)

Having said all of that… how much data can we store in a
table? How many rows?

Assume that the data set per time interval per buoy is 1K in
size and that there are going to be billions of these data points in the
database. (And we can store each buoy’s result in a different column of the
row.)

What I’d like to have is some sort of formula that we can
use to help determine a realistic size limit before performance falls
apart.

There’s more to this but the idea is to explore HBase’s
capabilities and limitations. We need to know this because we'd like to plan for any problems and design to avoid them without having to try and test this solution without having to buy and build a 2000 node cluster...

Thx

-Mike
PS. JDCryans, does this help explain the problem?

Re: Question about table size limitations...

Posted by Ted Dunning <td...@maprtech.com>.

Michael,

This sounds like an excellent way to organize this data (bouy + time
interval id => sequence of data points).  Clearly you will also need an
auxiliary
table that maps geolocation => {bouy,time}+

The question (as you point out) is whether hbase is going to be happy to
store so much data.  The current state is "it depends", but you are
definitely pushing things.

There may be some interesting technical alternatives that make this much
easier.  I will contact you off-list about these.

On Fri, Apr 15, 2011 at 1:01 PM, Michael Segel <mi...@hotmail.com>wrote:

>
>
>
> I have a question concerning if or what is the practical
> size limitations of a Table in Hbase.
>
>
> I was asked ‘how many rows can one reasonably expect HBase
> to handle…’ and the person with the question didn’t like my “It depends…”
> answer. (I’m a consultant and the answer to every problem has to start with
> a “It
> depends…” caveat. :-)
> )
>
> In trying to ascertain a practical answer, I’ve created this
> hypothetical problem and hopefully someone with a bit more insight and
> knowledge can provide a better answer.
> Please note that this is a hypothetical example and any resemblance to a
> real life problem is a coincidence.
>
>
> We have a fleet of petroleum exploration vessels. Each
> vessel tows a set of sonar buoys to take measurements of the ocean’s
> floor.  Overlapping of searches can occur.
> (Crisscross patterns)
>
> So our data sets contain both a geospatial aspect along with
> a time series aspect. The complete data set of a single ocean can be large.
> Measured in 10's of PBs.
>
>
> There are two know use cases:
>
> ·
> For a given ‘sweep’ process that data set.
> (Sweep is a data set for a given ship in a given grid space for a  single
> day where we know the start and end
> times of the sweep.)
>
> ·
> For a given grid_id (geo spatial box) process
> all of the data collected by all of the sweeps that occurred. (Different
> ships,
> dates, etc …)
>
>
> Having said all of that… how much data can we store in a
> table? How many rows?
>
>
> Assume that the data set per time interval per buoy is 1K in
> size and that there are going to be billions of these data points in the
> database. (And we can store each buoy’s result in a different column of the
> row.)
>
> What I’d like to have is some sort of formula that we can
> use to help determine a realistic size limit before performance falls
> apart.
>
>
>
> There’s more to this but the idea is to explore HBase’s
> capabilities and limitations. We need to know this because we'd like to
> plan for any problems and design to avoid them without having to try and
> test this solution without having to buy and build a 2000 node cluster...
>
> Thx
>
>
>
> -Mike
> PS. JDCryans, does this help explain the problem?
>
>
>