You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Jean-Daniel Cryans <jd...@apache.org> on 2010/03/01 01:13:50 UTC
Re: Handling Interactive versus Batch Calculations

Answers inline.

J-D

> 1.  What is the mechanism by which you can build your own calculations that
> return results quickly in HBase?  Is it just Java classes or some other
> technique.

Vague question, I'll say... yes ;) Exactly what kind of processing are
you talking about? Is it the kind that requires to join 2 tables
together on a foreign key and then perform some aggregation? If so,
that won't work with HBase because we expect that the tables are huge
(TBs) so the tradeoff is higher scalability for less functionality.

A good example of what HBase is capable of is time series. Let's say
your keys are built with timestamps and then a set of tags (machine
name, etc). You want to get the data points for machine X from day M
to N for CPU load. The good way of doing that in HBase will be to do a
short scan (a scan that's specified with a start key and an end key)
to take care of the timestamp, a filter on the row key to grab the
machine name (which could be anywhere in the tags) and another filter
to fetch only the data points for load. This will effectively scan
maybe a million rows but will only return a handful of values that you
can graph.

> 2.  For these types of calculations, does HBase handle acquiring the data if
> its distributed across multiple boxes like MapReduce does, or do I have to
> write my own algorithms that seek out the data on the write nodes?

HBase distributes the management of the data to what we call region
servers, that means all the data distribution stuff is all handled by
HBase as well as getting/writing it. For a good overview of the
storage mechanism, please see
http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

> 3.  Is it possible to break-up the work across multiple nodes and then bring
> it together like a MapReduce, but without the performance penalty of using
> the MapReduce framework?  In other words, if HBase knows that files A-D are
> on node 1, E-G are on node 2, can I write a function that says "sum up X on
> node 1 locally and y on node 2 locally" and bring it back to me combined?

Not yet, this is what we call co-processors and it's still under
active development. See
https://issues.apache.org/jira/browse/HBASE-2000

> 4.  Are there ways to guarantee that the computation will happen in-memory
> on the local column store, or is this the only place that such calculations
> happen?

Same answer as 3.

>
> Apologies for what must be very basic questions.  Any pointers really
> appreciated.  Thank you.
>
> Best Regards,
>
> Nenshad
>
> --
> Nenshad D. Bardoliwalla
> Twitter: http://twitter.com/nenshad
> Book: http://www.driventoperform.net
> Blog: http://bardoli.blogspot.com
>