You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Charles Kaminski <fr...@yahoo.com> on 2008/02/04 23:44:46 UTC

Evaluating HBase

Hi All,

I am evaluating HBase and I am not sure if our
use-case fits naturally with HBase’s capabilities.  I
would appreciate any help.

We would like to store a large number (billions) of
rows in HBase using a key field to access the values. 
We will then need to continually add, update, and
delete rows.  This is our master table.  What I
describe here naturally fits into what HBase is
designed to do.

It’s this next part that I’m having trouble finding
documentation for.

We would like to use HBase’s parallel processing
capabilities to periodically spawn off other temporary
tables when requested.  We would like to take the
first table (the master table), go through the key and
field values in its rows.  From this, we would like to
create a second table organized differently from the
master table.  We would also need to include count,
max, min, and other things specific to the particular
request. 

This seems like textbook map-reduce functionality, but
I don’t see too much in HBase referencing this kind of
setup.  Also there is a reference in HBase’s 10 minute
startup guide that states “[HBase doesn’t] need
mapreduce”.

I suppose we could use HBase as an input and output to
Hadoop's map reduce functionality.  If we did that,
what would guarantee that we were mapping to local
data?

Any help would be greatly appreciated.  If you have a
reference to a previous discussion or document I could
read, that would be appreciated as well.

-FA



      ____________________________________________________________________________________
Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs

Re: Evaluating HBase

Posted by Bryan Duxbury <br...@rapleaf.com>.
The actual mapping and reducing will happen locally on whatever host  
is processing the task, but the storage and retrieval of the data  
you'll be acting on may or may not be on another machine.

On Feb 4, 2008, at 4:17 PM, Charles Kaminski wrote:

> Bryan,
>
> Thanks again.
>
> I believe I have it.  I'm also assuming here that
> TableInputFormat and TableOutputFormat reads and
> writes in parallel and locally on each node.  If my
> assumptions here are correct, then we could probably
> start building some prototypes for our case.
>
> Just to finish, I could use the TableMap, and
> TableReduce, but there is no guarantee that the data
> will be processed locally.  Correct (or are these two
> just for resorting)?
>
>
>
> --- Bryan Duxbury <br...@rapleaf.com> wrote:
>
>> You have it exactly right. There's nothing more to
>> it than that. Is
>> there something further you have questions about?
>>
>> -Bryan
>>
>> On Feb 4, 2008, at 3:32 PM, Charles Kaminski wrote:
>>
>>> Hi Bryan,
>>>
>>> Thanks for the thoughtful response.  Could you
>> take a
>>> moment to write a few lines at a high level on how
>> you
>>> would leverage Hadoop and HBase to fit this use
>> case?
>>>
>>> I think I’m reading the following in your
>> response:
>>> 1. Build and maintain the large master table in
>> HBase
>>> 2. Use TableInputFormat to convert HBase data into
>> a
>>> raw format for Hadoop on HDF
>>> 3. Run Map Reduce in Hadoop
>>> 4. Use TableOutputFormat to build the new table
>>>
>>> Do I have that right?
>>>
>>>
>>> --- Bryan Duxbury <br...@rapleaf.com> wrote:
>>>
>>>> This seems like a good fit for HBase in general.
>>>> You're right, it's
>>>> an application for a MapReduce-style processing.
>>>> HBase doesn't need
>>>> MapReduce in the sense that HBase is not built
>>>> dependent upon it.
>>>> However, we are interested in making HBase play
>> well
>>>> with MapReduce,
>>>> and have several handy classes (TableInputFormat,
>>>> TableOutputFormat)
>>>> in HBase for doing that with Hadoop's MapReduce.
>>>>
>>>> In the current version of HBase, you're correct,
>>>> there is no way to
>>>> guarantee that you are mapping over local data.
>> Data
>>>> locality is
>>>> something that we are very interested in, but
>>>> haven't really had the
>>>> time to pursue yet. We're more concerned about
>> the
>>>> general
>>>> reliability and scalability of HBase. We also
>> need
>>>> to have HDFS, the
>>>> underlying distributed file system, support
>>>> locality-awareness, which
>>>> is something it hasn't gotten completely down
>> yet.
>>>>
>>>> I think you should probably give HBase a shot and
>>>> see how it goes.
>>>> We're very, very interested in seeing how HBase
>>>> performs under
>>>> massive loads and datasets.
>>>>
>>>> -Bryan
>>>>
>>>> On Feb 4, 2008, at 2:44 PM, Charles Kaminski
>> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> I am evaluating HBase and I am not sure if our
>>>>> use-case fits naturally with HBase’s
>> capabilities.
>>>>  I
>>>>> would appreciate any help.
>>>>>
>>>>> We would like to store a large number (billions)
>>>> of
>>>>> rows in HBase using a key field to access the
>>>> values.
>>>>> We will then need to continually add, update,
>> and
>>>>> delete rows.  This is our master table.  What I
>>>>> describe here naturally fits into what HBase is
>>>>> designed to do.
>>>>>
>>>>> It’s this next part that I’m having trouble
>>>> finding
>>>>> documentation for.
>>>>>
>>>>> We would like to use HBase’s parallel processing
>>>>> capabilities to periodically spawn off other
>>>> temporary
>>>>> tables when requested.  We would like to take
>> the
>>>>> first table (the master table), go through the
>> key
>>>> and
>>>>> field values in its rows.  From this, we would
>>>> like to
>>>>> create a second table organized differently from
>>>> the
>>>>> master table.  We would also need to include
>>>> count,
>>>>> max, min, and other things specific to the
>>>> particular
>>>>> request.
>>>>>
>>>>> This seems like textbook map-reduce
>> functionality,
>>>> but
>>>>> I don’t see too much in HBase referencing this
>>>> kind of
>>>>> setup.  Also there is a reference in HBase’s 10
>>>> minute
>>>>> startup guide that states “[HBase doesn’t] need
>>>>> mapreduce”.
>>>>>
>>>>> I suppose we could use HBase as an input and
>>>> output to
>>>>> Hadoop's map reduce functionality.  If we did
>>>> that,
>>>>> what would guarantee that we were mapping to
>> local
>>>>> data?
>>>>>
>>>>> Any help would be greatly appreciated.  If you
>>>> have a
>>>>> reference to a previous discussion or document I
>>>> could
>>>>> read, that would be appreciated as well.
>>>>>
>>>>> -FA
>
>
>        
> ______________________________________________________________________ 
> ______________
> Looking for last minute shopping deals?
> Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/ 
> newsearch/category.php?category=shopping


Re: Evaluating HBase 3

Posted by Bryan Duxbury <br...@rapleaf.com>.
You say "selected a single record based on row in where clause". Are  
you working in the shell?

-Bryan

On Feb 7, 2008, at 4:03 PM, Charles Kaminski wrote:

> Hi All,
>
> We're running into sever performance issues.  I'm
> hoping that there is something simple we can do to
> resolve the issues.  Any help would be appreciated.
>
> Here's what we did:
> 1.  Loaded 1,000 records into a table with only two
> columns - row and content:.  Row data is 12 bytes and
> content: data is 23 bytes long.
> 2. Using HBase, selected a single record based on row
> in the where clause.  Did this for a few different
> records.  Performance was consistantly 0.01 seconds as
> reported by Hbase.
> 3. Loaded 1,000,000 records into the same table.  This
> took 248 seconds using random row values.
> 4. Ran the exact same select statments again as in
> step 2.  These consistantly took 2 to 3 seconds to
> return a single record.
>
> 2 to 3 seconds to return a single record using a key
> value suggests a major issue with our setup.  I'm
> hoping you agree and can point us to something we're
> doing wrong.
>
>
>
>
>
>
>        
> ______________________________________________________________________ 
> ______________
> Looking for last minute shopping deals?
> Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/ 
> newsearch/category.php?category=shopping


Re: Evaluating HBase 3

Posted by Charles Kaminski <fr...@yahoo.com>.
St.Ack and Bryan,

Turns out it was inconsistant testing on our part. 
When we tested with HBase Shell on the server and got
similar results, we thought we were ruling out any
issues with machines connecteding to the cluster. 

The posts questioning HBase Shell as a good test
prompted us to go back and take a more indepth review.

Thanks again!

--- stack <st...@duboce.net> wrote:

> Lets try and figure out whats going on Charles.
> 
> The figures on the end of this page have us random
> reading bigger values 
> out of a table of 1M rows at somewhere between 150
> and 300 rows a 
> second, dependent on hbase version (Whats your
> version?)
> 
> Want to send us the code your java apps are using to
> access hbase so we 
> can check it out?
> 
> Thanks,
> St.Ack
> 
> 
> Charles Kaminski wrote:
> > Hi St.Ack,
> >
> > Thanks for the response.  The performance changes
> > below are consistent with what we find in our java
> > app.  We used Hbase Shell directly on the server
> to
> > rule out anything we might be doing wrong.
> >
> >
> > --- stack <st...@duboce.net> wrote:
> >
> >   
> >> You are using the shell to do your fetching?  Try
> >> writing a little java 
> >> program.
> >> St.Ack
> >>
> >>
> >> Charles Kaminski wrote:
> >>     
> >>> Hi All,
> >>>
> >>> We're running into sever performance issues. 
> I'm
> >>> hoping that there is something simple we can do
> to
> >>> resolve the issues.  Any help would be
> >>>       
> >> appreciated.
> >>     
> >>> Here's what we did:
> >>> 1.  Loaded 1,000 records into a table with only
> >>>       
> >> two
> >>     
> >>> columns - row and content:.  Row data is 12
> bytes
> >>>       
> >> and
> >>     
> >>> content: data is 23 bytes long.
> >>> 2. Using HBase, selected a single record based
> on
> >>>       
> >> row
> >>     
> >>> in the where clause.  Did this for a few
> different
> >>> records.  Performance was consistantly 0.01
> >>>       
> >> seconds as
> >>     
> >>> reported by Hbase.
> >>> 3. Loaded 1,000,000 records into the same table.
> 
> >>>       
> >> This
> >>     
> >>> took 248 seconds using random row values.
> >>> 4. Ran the exact same select statments again as
> in
> >>> step 2.  These consistantly took 2 to 3 seconds
> to
> >>> return a single record.
> >>>
> >>> 2 to 3 seconds to return a single record using a
> >>>       
> >> key
> >>     
> >>> value suggests a major issue with our setup. 
> I'm
> >>> hoping you agree and can point us to something
> >>>       
> >> we're
> >>     
> >>> doing wrong.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>      
> >>>       
> >
>
____________________________________________________________________________________
> >   
> >>> Looking for last minute shopping deals?  
> >>> Find them fast with Yahoo! Search. 
> >>>       
> >
>
http://tools.search.yahoo.com/newsearch/category.php?category=shopping
> >   
> >>>   
> >>>       
> >>     
> >
> >
> >
> >      
>
____________________________________________________________________________________
> > Be a better friend, newshound, and 
> > know-it-all with Yahoo! Mobile.  Try it now. 
>
http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
> 
> >
> >   
> 
> 



      ____________________________________________________________________________________
Looking for last minute shopping deals?  
Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/newsearch/category.php?category=shopping

Re: Evaluating HBase 3

Posted by stack <st...@duboce.net>.
Lets try and figure out whats going on Charles.

The figures on the end of this page have us random reading bigger values 
out of a table of 1M rows at somewhere between 150 and 300 rows a 
second, dependent on hbase version (Whats your version?)

Want to send us the code your java apps are using to access hbase so we 
can check it out?

Thanks,
St.Ack


Charles Kaminski wrote:
> Hi St.Ack,
>
> Thanks for the response.  The performance changes
> below are consistent with what we find in our java
> app.  We used Hbase Shell directly on the server to
> rule out anything we might be doing wrong.
>
>
> --- stack <st...@duboce.net> wrote:
>
>   
>> You are using the shell to do your fetching?  Try
>> writing a little java 
>> program.
>> St.Ack
>>
>>
>> Charles Kaminski wrote:
>>     
>>> Hi All,
>>>
>>> We're running into sever performance issues.  I'm
>>> hoping that there is something simple we can do to
>>> resolve the issues.  Any help would be
>>>       
>> appreciated.
>>     
>>> Here's what we did:
>>> 1.  Loaded 1,000 records into a table with only
>>>       
>> two
>>     
>>> columns - row and content:.  Row data is 12 bytes
>>>       
>> and
>>     
>>> content: data is 23 bytes long.
>>> 2. Using HBase, selected a single record based on
>>>       
>> row
>>     
>>> in the where clause.  Did this for a few different
>>> records.  Performance was consistantly 0.01
>>>       
>> seconds as
>>     
>>> reported by Hbase.
>>> 3. Loaded 1,000,000 records into the same table. 
>>>       
>> This
>>     
>>> took 248 seconds using random row values.
>>> 4. Ran the exact same select statments again as in
>>> step 2.  These consistantly took 2 to 3 seconds to
>>> return a single record.
>>>
>>> 2 to 3 seconds to return a single record using a
>>>       
>> key
>>     
>>> value suggests a major issue with our setup.  I'm
>>> hoping you agree and can point us to something
>>>       
>> we're
>>     
>>> doing wrong.
>>>
>>>
>>>
>>>
>>>
>>>
>>>      
>>>       
> ____________________________________________________________________________________
>   
>>> Looking for last minute shopping deals?  
>>> Find them fast with Yahoo! Search. 
>>>       
> http://tools.search.yahoo.com/newsearch/category.php?category=shopping
>   
>>>   
>>>       
>>     
>
>
>
>       ____________________________________________________________________________________
> Be a better friend, newshound, and 
> know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ 
>
>   


Re: Evaluating HBase 3

Posted by Charles Kaminski <fr...@yahoo.com>.
Hi St.Ack,

Thanks for the response.  The performance changes
below are consistent with what we find in our java
app.  We used Hbase Shell directly on the server to
rule out anything we might be doing wrong.


--- stack <st...@duboce.net> wrote:

> You are using the shell to do your fetching?  Try
> writing a little java 
> program.
> St.Ack
> 
> 
> Charles Kaminski wrote:
> > Hi All,
> >
> > We're running into sever performance issues.  I'm
> > hoping that there is something simple we can do to
> > resolve the issues.  Any help would be
> appreciated.
> >
> > Here's what we did:
> > 1.  Loaded 1,000 records into a table with only
> two
> > columns - row and content:.  Row data is 12 bytes
> and
> > content: data is 23 bytes long.
> > 2. Using HBase, selected a single record based on
> row
> > in the where clause.  Did this for a few different
> > records.  Performance was consistantly 0.01
> seconds as
> > reported by Hbase.
> > 3. Loaded 1,000,000 records into the same table. 
> This
> > took 248 seconds using random row values.
> > 4. Ran the exact same select statments again as in
> > step 2.  These consistantly took 2 to 3 seconds to
> > return a single record.
> >
> > 2 to 3 seconds to return a single record using a
> key
> > value suggests a major issue with our setup.  I'm
> > hoping you agree and can point us to something
> we're
> > doing wrong.
> >
> >
> >
> >
> >
> >
> >      
>
____________________________________________________________________________________
> > Looking for last minute shopping deals?  
> > Find them fast with Yahoo! Search. 
>
http://tools.search.yahoo.com/newsearch/category.php?category=shopping
> >   
> 
> 



      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ 


Re: Evaluating HBase 3

Posted by stack <st...@duboce.net>.
You are using the shell to do your fetching?  Try writing a little java 
program.
St.Ack


Charles Kaminski wrote:
> Hi All,
>
> We're running into sever performance issues.  I'm
> hoping that there is something simple we can do to
> resolve the issues.  Any help would be appreciated.
>
> Here's what we did:
> 1.  Loaded 1,000 records into a table with only two
> columns - row and content:.  Row data is 12 bytes and
> content: data is 23 bytes long.
> 2. Using HBase, selected a single record based on row
> in the where clause.  Did this for a few different
> records.  Performance was consistantly 0.01 seconds as
> reported by Hbase.
> 3. Loaded 1,000,000 records into the same table.  This
> took 248 seconds using random row values.
> 4. Ran the exact same select statments again as in
> step 2.  These consistantly took 2 to 3 seconds to
> return a single record.
>
> 2 to 3 seconds to return a single record using a key
> value suggests a major issue with our setup.  I'm
> hoping you agree and can point us to something we're
> doing wrong.
>
>
>
>
>
>
>       ____________________________________________________________________________________
> Looking for last minute shopping deals?  
> Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/newsearch/category.php?category=shopping
>   


Evaluating HBase 3

Posted by Charles Kaminski <fr...@yahoo.com>.
Hi All,

We're running into sever performance issues.  I'm
hoping that there is something simple we can do to
resolve the issues.  Any help would be appreciated.

Here's what we did:
1.  Loaded 1,000 records into a table with only two
columns - row and content:.  Row data is 12 bytes and
content: data is 23 bytes long.
2. Using HBase, selected a single record based on row
in the where clause.  Did this for a few different
records.  Performance was consistantly 0.01 seconds as
reported by Hbase.
3. Loaded 1,000,000 records into the same table.  This
took 248 seconds using random row values.
4. Ran the exact same select statments again as in
step 2.  These consistantly took 2 to 3 seconds to
return a single record.

2 to 3 seconds to return a single record using a key
value suggests a major issue with our setup.  I'm
hoping you agree and can point us to something we're
doing wrong.






      ____________________________________________________________________________________
Looking for last minute shopping deals?  
Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/newsearch/category.php?category=shopping

Re: Evaluating HBase - 2

Posted by stack <st...@duboce.net>.
See http://wiki.apache.org/hadoop/Hbase/FAQ#1
St.Ack

Charles Kaminski wrote:
> Hi All,
>
> Are there any code or eclipse project examples out
> there of connecting to an hbase cluster and
> manipulating data?
>
>
>
>
>       ____________________________________________________________________________________
> Looking for last minute shopping deals?  
> Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/newsearch/category.php?category=shopping
>   


Evaluating HBase - 2

Posted by Charles Kaminski <fr...@yahoo.com>.
Hi All,

Are there any code or eclipse project examples out
there of connecting to an hbase cluster and
manipulating data?




      ____________________________________________________________________________________
Looking for last minute shopping deals?  
Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/newsearch/category.php?category=shopping

Re: Evaluating HBase

Posted by Charles Kaminski <fr...@yahoo.com>.
Bryan,

Thanks again.

I believe I have it.  I'm also assuming here that
TableInputFormat and TableOutputFormat reads and
writes in parallel and locally on each node.  If my
assumptions here are correct, then we could probably
start building some prototypes for our case.

Just to finish, I could use the TableMap, and
TableReduce, but there is no guarantee that the data
will be processed locally.  Correct (or are these two
just for resorting)?



--- Bryan Duxbury <br...@rapleaf.com> wrote:

> You have it exactly right. There's nothing more to
> it than that. Is  
> there something further you have questions about?
> 
> -Bryan
> 
> On Feb 4, 2008, at 3:32 PM, Charles Kaminski wrote:
> 
> > Hi Bryan,
> >
> > Thanks for the thoughtful response.  Could you
> take a
> > moment to write a few lines at a high level on how
> you
> > would leverage Hadoop and HBase to fit this use
> case?
> >
> > I think I’m reading the following in your
> response:
> > 1. Build and maintain the large master table in
> HBase
> > 2. Use TableInputFormat to convert HBase data into
> a
> > raw format for Hadoop on HDF
> > 3. Run Map Reduce in Hadoop
> > 4. Use TableOutputFormat to build the new table
> >
> > Do I have that right?
> >
> >
> > --- Bryan Duxbury <br...@rapleaf.com> wrote:
> >
> >> This seems like a good fit for HBase in general.
> >> You're right, it's
> >> an application for a MapReduce-style processing.
> >> HBase doesn't need
> >> MapReduce in the sense that HBase is not built
> >> dependent upon it.
> >> However, we are interested in making HBase play
> well
> >> with MapReduce,
> >> and have several handy classes (TableInputFormat,
> >> TableOutputFormat)
> >> in HBase for doing that with Hadoop's MapReduce.
> >>
> >> In the current version of HBase, you're correct,
> >> there is no way to
> >> guarantee that you are mapping over local data.
> Data
> >> locality is
> >> something that we are very interested in, but
> >> haven't really had the
> >> time to pursue yet. We're more concerned about
> the
> >> general
> >> reliability and scalability of HBase. We also
> need
> >> to have HDFS, the
> >> underlying distributed file system, support
> >> locality-awareness, which
> >> is something it hasn't gotten completely down
> yet.
> >>
> >> I think you should probably give HBase a shot and
> >> see how it goes.
> >> We're very, very interested in seeing how HBase
> >> performs under
> >> massive loads and datasets.
> >>
> >> -Bryan
> >>
> >> On Feb 4, 2008, at 2:44 PM, Charles Kaminski
> wrote:
> >>
> >>> Hi All,
> >>>
> >>> I am evaluating HBase and I am not sure if our
> >>> use-case fits naturally with HBase’s
> capabilities.
> >>  I
> >>> would appreciate any help.
> >>>
> >>> We would like to store a large number (billions)
> >> of
> >>> rows in HBase using a key field to access the
> >> values.
> >>> We will then need to continually add, update,
> and
> >>> delete rows.  This is our master table.  What I
> >>> describe here naturally fits into what HBase is
> >>> designed to do.
> >>>
> >>> It’s this next part that I’m having trouble
> >> finding
> >>> documentation for.
> >>>
> >>> We would like to use HBase’s parallel processing
> >>> capabilities to periodically spawn off other
> >> temporary
> >>> tables when requested.  We would like to take
> the
> >>> first table (the master table), go through the
> key
> >> and
> >>> field values in its rows.  From this, we would
> >> like to
> >>> create a second table organized differently from
> >> the
> >>> master table.  We would also need to include
> >> count,
> >>> max, min, and other things specific to the
> >> particular
> >>> request.
> >>>
> >>> This seems like textbook map-reduce
> functionality,
> >> but
> >>> I don’t see too much in HBase referencing this
> >> kind of
> >>> setup.  Also there is a reference in HBase’s 10
> >> minute
> >>> startup guide that states “[HBase doesn’t] need
> >>> mapreduce”.
> >>>
> >>> I suppose we could use HBase as an input and
> >> output to
> >>> Hadoop's map reduce functionality.  If we did
> >> that,
> >>> what would guarantee that we were mapping to
> local
> >>> data?
> >>>
> >>> Any help would be greatly appreciated.  If you
> >> have a
> >>> reference to a previous discussion or document I
> >> could
> >>> read, that would be appreciated as well.
> >>>
> >>> -FA


      ____________________________________________________________________________________
Looking for last minute shopping deals?  
Find them fast with Yahoo! Search.  http://tools.search.yahoo.com/newsearch/category.php?category=shopping

Re: Evaluating HBase

Posted by Bryan Duxbury <br...@rapleaf.com>.
You have it exactly right. There's nothing more to it than that. Is  
there something further you have questions about?

-Bryan

On Feb 4, 2008, at 3:32 PM, Charles Kaminski wrote:

> Hi Bryan,
>
> Thanks for the thoughtful response.  Could you take a
> moment to write a few lines at a high level on how you
> would leverage Hadoop and HBase to fit this use case?
>
> I think I’m reading the following in your response:
> 1. Build and maintain the large master table in HBase
> 2. Use TableInputFormat to convert HBase data into a
> raw format for Hadoop on HDF
> 3. Run Map Reduce in Hadoop
> 4. Use TableOutputFormat to build the new table
>
> Do I have that right?
>
>
> --- Bryan Duxbury <br...@rapleaf.com> wrote:
>
>> This seems like a good fit for HBase in general.
>> You're right, it's
>> an application for a MapReduce-style processing.
>> HBase doesn't need
>> MapReduce in the sense that HBase is not built
>> dependent upon it.
>> However, we are interested in making HBase play well
>> with MapReduce,
>> and have several handy classes (TableInputFormat,
>> TableOutputFormat)
>> in HBase for doing that with Hadoop's MapReduce.
>>
>> In the current version of HBase, you're correct,
>> there is no way to
>> guarantee that you are mapping over local data. Data
>> locality is
>> something that we are very interested in, but
>> haven't really had the
>> time to pursue yet. We're more concerned about the
>> general
>> reliability and scalability of HBase. We also need
>> to have HDFS, the
>> underlying distributed file system, support
>> locality-awareness, which
>> is something it hasn't gotten completely down yet.
>>
>> I think you should probably give HBase a shot and
>> see how it goes.
>> We're very, very interested in seeing how HBase
>> performs under
>> massive loads and datasets.
>>
>> -Bryan
>>
>> On Feb 4, 2008, at 2:44 PM, Charles Kaminski wrote:
>>
>>> Hi All,
>>>
>>> I am evaluating HBase and I am not sure if our
>>> use-case fits naturally with HBase’s capabilities.
>>  I
>>> would appreciate any help.
>>>
>>> We would like to store a large number (billions)
>> of
>>> rows in HBase using a key field to access the
>> values.
>>> We will then need to continually add, update, and
>>> delete rows.  This is our master table.  What I
>>> describe here naturally fits into what HBase is
>>> designed to do.
>>>
>>> It’s this next part that I’m having trouble
>> finding
>>> documentation for.
>>>
>>> We would like to use HBase’s parallel processing
>>> capabilities to periodically spawn off other
>> temporary
>>> tables when requested.  We would like to take the
>>> first table (the master table), go through the key
>> and
>>> field values in its rows.  From this, we would
>> like to
>>> create a second table organized differently from
>> the
>>> master table.  We would also need to include
>> count,
>>> max, min, and other things specific to the
>> particular
>>> request.
>>>
>>> This seems like textbook map-reduce functionality,
>> but
>>> I don’t see too much in HBase referencing this
>> kind of
>>> setup.  Also there is a reference in HBase’s 10
>> minute
>>> startup guide that states “[HBase doesn’t] need
>>> mapreduce”.
>>>
>>> I suppose we could use HBase as an input and
>> output to
>>> Hadoop's map reduce functionality.  If we did
>> that,
>>> what would guarantee that we were mapping to local
>>> data?
>>>
>>> Any help would be greatly appreciated.  If you
>> have a
>>> reference to a previous discussion or document I
>> could
>>> read, that would be appreciated as well.
>>>
>>> -FA
>
>
>        
> ______________________________________________________________________ 
> ______________
> Be a better friend, newshound, and
> know-it-all with Yahoo! Mobile.  Try it now.  http:// 
> mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ
>


Re: Evaluating HBase

Posted by Charles Kaminski <fr...@yahoo.com>.
Hi Bryan,

Thanks for the thoughtful response.  Could you take a
moment to write a few lines at a high level on how you
would leverage Hadoop and HBase to fit this use case?

I think I’m reading the following in your response:
1. Build and maintain the large master table in HBase
2. Use TableInputFormat to convert HBase data into a
raw format for Hadoop on HDF
3. Run Map Reduce in Hadoop
4. Use TableOutputFormat to build the new table

Do I have that right?


--- Bryan Duxbury <br...@rapleaf.com> wrote:

> This seems like a good fit for HBase in general.
> You're right, it's  
> an application for a MapReduce-style processing.
> HBase doesn't need  
> MapReduce in the sense that HBase is not built
> dependent upon it.  
> However, we are interested in making HBase play well
> with MapReduce,  
> and have several handy classes (TableInputFormat,
> TableOutputFormat)  
> in HBase for doing that with Hadoop's MapReduce.
> 
> In the current version of HBase, you're correct,
> there is no way to  
> guarantee that you are mapping over local data. Data
> locality is  
> something that we are very interested in, but
> haven't really had the  
> time to pursue yet. We're more concerned about the
> general  
> reliability and scalability of HBase. We also need
> to have HDFS, the  
> underlying distributed file system, support
> locality-awareness, which  
> is something it hasn't gotten completely down yet.
> 
> I think you should probably give HBase a shot and
> see how it goes.  
> We're very, very interested in seeing how HBase
> performs under  
> massive loads and datasets.
> 
> -Bryan
> 
> On Feb 4, 2008, at 2:44 PM, Charles Kaminski wrote:
> 
> > Hi All,
> >
> > I am evaluating HBase and I am not sure if our
> > use-case fits naturally with HBase’s capabilities.
>  I
> > would appreciate any help.
> >
> > We would like to store a large number (billions)
> of
> > rows in HBase using a key field to access the
> values.
> > We will then need to continually add, update, and
> > delete rows.  This is our master table.  What I
> > describe here naturally fits into what HBase is
> > designed to do.
> >
> > It’s this next part that I’m having trouble
> finding
> > documentation for.
> >
> > We would like to use HBase’s parallel processing
> > capabilities to periodically spawn off other
> temporary
> > tables when requested.  We would like to take the
> > first table (the master table), go through the key
> and
> > field values in its rows.  From this, we would
> like to
> > create a second table organized differently from
> the
> > master table.  We would also need to include
> count,
> > max, min, and other things specific to the
> particular
> > request.
> >
> > This seems like textbook map-reduce functionality,
> but
> > I don’t see too much in HBase referencing this
> kind of
> > setup.  Also there is a reference in HBase’s 10
> minute
> > startup guide that states “[HBase doesn’t] need
> > mapreduce”.
> >
> > I suppose we could use HBase as an input and
> output to
> > Hadoop's map reduce functionality.  If we did
> that,
> > what would guarantee that we were mapping to local
> > data?
> >
> > Any help would be greatly appreciated.  If you
> have a
> > reference to a previous discussion or document I
> could
> > read, that would be appreciated as well.
> >
> > -FA


      ____________________________________________________________________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9tAcJ 


Re: Evaluating HBase

Posted by Bryan Duxbury <br...@rapleaf.com>.
This seems like a good fit for HBase in general. You're right, it's  
an application for a MapReduce-style processing. HBase doesn't need  
MapReduce in the sense that HBase is not built dependent upon it.  
However, we are interested in making HBase play well with MapReduce,  
and have several handy classes (TableInputFormat, TableOutputFormat)  
in HBase for doing that with Hadoop's MapReduce.

In the current version of HBase, you're correct, there is no way to  
guarantee that you are mapping over local data. Data locality is  
something that we are very interested in, but haven't really had the  
time to pursue yet. We're more concerned about the general  
reliability and scalability of HBase. We also need to have HDFS, the  
underlying distributed file system, support locality-awareness, which  
is something it hasn't gotten completely down yet.

I think you should probably give HBase a shot and see how it goes.  
We're very, very interested in seeing how HBase performs under  
massive loads and datasets.

-Bryan

On Feb 4, 2008, at 2:44 PM, Charles Kaminski wrote:

> Hi All,
>
> I am evaluating HBase and I am not sure if our
> use-case fits naturally with HBase’s capabilities.  I
> would appreciate any help.
>
> We would like to store a large number (billions) of
> rows in HBase using a key field to access the values.
> We will then need to continually add, update, and
> delete rows.  This is our master table.  What I
> describe here naturally fits into what HBase is
> designed to do.
>
> It’s this next part that I’m having trouble finding
> documentation for.
>
> We would like to use HBase’s parallel processing
> capabilities to periodically spawn off other temporary
> tables when requested.  We would like to take the
> first table (the master table), go through the key and
> field values in its rows.  From this, we would like to
> create a second table organized differently from the
> master table.  We would also need to include count,
> max, min, and other things specific to the particular
> request.
>
> This seems like textbook map-reduce functionality, but
> I don’t see too much in HBase referencing this kind of
> setup.  Also there is a reference in HBase’s 10 minute
> startup guide that states “[HBase doesn’t] need
> mapreduce”.
>
> I suppose we could use HBase as an input and output to
> Hadoop's map reduce functionality.  If we did that,
> what would guarantee that we were mapping to local
> data?
>
> Any help would be greatly appreciated.  If you have a
> reference to a previous discussion or document I could
> read, that would be appreciated as well.
>
> -FA
>
>
>
>        
> ______________________________________________________________________ 
> ______________
> Never miss a thing.  Make Yahoo your home page.
> http://www.yahoo.com/r/hs