You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Charles Kaminski <fr...@yahoo.com> on 2008/02/04 23:44:46 UTC

Evaluating HBase

Hi All,

I am evaluating HBase and I am not sure if our
use-case fits naturally with HBases capabilities. I
would appreciate any help.

We would like to store a large number (billions) of
rows in HBase using a key field to access the values.
We will then need to continually add, update, and
delete rows. This is our master table. What I
describe here naturally fits into what HBase is
designed to do.

Its this next part that Im having trouble finding
documentation for.

We would like to use HBases parallel processing
capabilities to periodically spawn off other temporary
tables when requested. We would like to take the
first table (the master table), go through the key and
field values in its rows. From this, we would like to
create a second table organized differently from the
master table. We would also need to include count,
max, min, and other things specific to the particular
request.

This seems like textbook map-reduce functionality, but
I dont see too much in HBase referencing this kind of
setup. Also there is a reference in HBases 10 minute
startup guide that states [HBase doesnt] need
mapreduce.

I suppose we could use HBase as an input and output to
Hadoop's map reduce functionality. If we did that,
what would guarantee that we were mapping to local
data?

Any help would be greatly appreciated. If you have a
reference to a previous discussion or document I could
read, that would be appreciated as well.

-FA

____________________________________________________________________________________
Never miss a thing. Make Yahoo your home page.
http://www.yahoo.com/r/hs

Re: Evaluating HBase

Posted by Bryan Duxbury <br...@rapleaf.com>.

The actual mapping and reducing will happen locally on whatever host  
is processing the task, but the storage and retrieval of the data  
you'll be acting on may or may not be on another machine.

On Feb 4, 2008, at 4:17 PM, Charles Kaminski wrote:

> Bryan,
>
> Thanks again.
>
> I believe I have it.  I'm also assuming here that
> TableInputFormat and TableOutputFormat reads and
> writes in parallel and locally on each node.  If my
> assumptions here are correct, then we could probably
> start building some prototypes for our case.
>
> Just to finish, I could use the TableMap, and
> TableReduce, but there is no guarantee that the data
> will be processed locally.  Correct (or are these two
> just for resorting)?
>
>
>
> --- Bryan Duxbury <br...@rapleaf.com> wrote:
>
>> You have it exactly right. There's nothing more to
>> it than that. Is
>> there something further you have questions about?
>>
>> -Bryan
>>
>> On Feb 4, 2008, at 3:32 PM, Charles Kaminski wrote:
>>
>>> Hi Bryan,
>>>
>>> Thanks for the thoughtful response.  Could you
>> take a
>>> moment to write a few lines at a high level on how
>> you
>>> would leverage Hadoop and HBase to fit this use
>> case?
>>>
>>> I think I’m reading the following in your
>> response:
>>> 1. Build and maintain the large master table in
>> HBase
>>> 2. Use TableInputFormat to convert HBase data into
>> a
>>> raw format for Hadoop on HDF
>>> 3. Run Map Reduce in Hadoop
>>> 4. Use TableOutputFormat to build the new table
>>>
>>> Do I have that right?
>>>
>>>
>>> --- Bryan Duxbury <br...@rapleaf.com> wrote:
>>>
>>>> This seems like a good fit for HBase in general.
>>>> You're right, it's
>>>> an application for a MapReduce-style processing.
>>>> HBase doesn't need
>>>> MapReduce in the sense that HBase is not built
>>>> dependent upon it.
>>>> However, we are interested in making HBase play
>> well
>>>> with MapReduce,
>>>> and have several handy classes (TableInputFormat,
>>>> TableOutputFormat)
>>>> in HBase for doing that with Hadoop's MapReduce.
>>>>
>>>> In the current version of HBase, you're correct,
>>>> there is no way to
>>>> guarantee that you are mapping over local data.
>> Data
>>>> locality is
>>>> something that we are very interested in, but
>>>> haven't really had the
>>>> time to pursue yet. We're more concerned about
>> the
>>>> general
>>>> reliability and scalability of HBase. We also
>> need
>>>> to have HDFS, the
>>>> underlying distributed file system, support
>>>> locality-awareness, which
>>>> is something it hasn't gotten completely down
>> yet.
>>>>
>>>> I think you should probably give HBase a shot and
>>>> see how it goes.
>>>> We're very, very interested in seeing how HBase
>>>> performs under
>>>> massive loads and datasets.
>>>>
>>>> -Bryan
>>>>
>>>> On Feb 4, 2008, at 2:44 PM, Charles Kaminski
>> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I am evaluating HBase and I am not sure if our
>>>>> use-case fits naturally with HBase’s
>> capabilities.
>>>>  I
>>>>> would appreciate any help.
>>>>>
>>>>> We would like to store a large number (billions)
>>>> of
>>>>> rows in HBase using a key field to access the
>>>> values.
>>>>> We will then need to continually add, update,
>> and
>>>>> delete rows.  This is our master table.  What I
>>>>> describe here naturally fits into what HBase is
>>>>> designed to do.
>>>>>
>>>>> It’s this next part that I’m having trouble
>>>> finding
>>>>> documentation for.
>>>>>
>>>>> We would like to use HBase’s parallel processing
>>>>> capabilities to periodically spawn off other
>>>> temporary
>>>>> tables when requested.  We would like to take
>> the
>>>>> first table (the master table), go through the
>> key
>>>> and
>>>>> field values in its rows.  From this, we would
>>>> like to
>>>>> create a second table organized differently from
>>>> the
>>>>> master table.  We would also need to include
>>>> count,
>>>>> max, min, and other things specific to the
>>>> particular
>>>>> request.
>>>>>
>>>>> This seems like textbook map-reduce
>> functionality,
>>>> but
>>>>> I don’t see too much in HBase referencing this
>>>> kind of
>>>>> setup.  Also there is a reference in HBase’s 10
>>>> minute
>>>>> startup guide that states “[HBase doesn’t] need
>>>>> mapreduce”.
>>>>>
>>>>> I suppose we could use HBase as an input and
>>>> output to
>>>>> Hadoop's map reduce functionality.  If we did
>>>> that,
>>>>> what would guarantee that we were mapping to
>> local
>>>>> data?
>>>>>
>>>>> Any help would be greatly appreciated.  If you
>>>> have a
>>>>> reference to a previous discussion or document I
>>>> could
>>>>> read, that would be appreciated as well.
>>>>>
>>>>> -FA
>
>
>        
> ______________________________________________________________________ 
> ______________
> Looking for last minute shopping deals?
> Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/ 
> newsearch/category.php?category=shopping