You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Wilm Schumacher <wi...@gmail.com> on 2015/03/09 03:55:39 UTC

feature request and question: "BigPut" and "BigGet"

Hi,

I have an idea for a feature in hbase which directly derives from the
idea of the MOB feature. As Jonathan Hsieh pointed out, the only thing
that limiting the feature to MOBs instead to LOBs is the memory
allocation on client and server side. However, the "LOB feature" would
be very handy for me and I think for some other users, too. Furthermore
the fast fetching small files problem could be solved.

The natural solution would be a "BigPut" and a "BigGet" class, which
encounter that problem, which are capable of dealing with large amount
of data without using too much memory. My plan by now is to creates
classes that do e.g.
BigPut BigPut.add( byte[] , byte[] , inputstream )
and
outputstream BigResult.value( byte[] , byte[] )
(in addition to the normal byte[] to byte[] member functions)

and pass the inputstreams through the AsyncProcess class to the RPC or
in reverse the outputstream for the BigResult class. By this plan the
client and server would have to throw out some threads to deal with
multiple streams[1].

By now I dig into the hbase-client (2.0.0) sources and I think that my
plan would be quite invasive to the existing code ... but is doable.
However, regarding the very open development model of hbase features I
think it could be adressed.

But I'm veeeery new to hbase development and just started to read the
source. Before I dig to deep into the problem I wanted to ask here if
there is any show stopper I'm missing by now?
To make a list of questions for that feature:
* As this plan probably won't break the thread model of the
hbase-client, is there any problem on the (region) server side? Or is
there any blocking/race condition problem elsewhere I miss by now?
* Is it a bad plan to pump several 100s of MB through one RPC in a
separate thread? If yes ... why?
* Are there any other fundamental problems I miss by now which makes
that a horrible plan?
* Is there already some dev onging? I didn't found something on jira.
But that doesn't mean anything :/
* Does anyone have a better name than "BigPut" :D?

And at last:
* Is it a better plan to create a separate "MOB/LOB service"?[2]

Best wishes

Wilm

[1] or one could limit the number of streams to one. By this the
threading problem would be much more simple to encounter as only one
"RPC" would be neccessary.

[2] on one hand it is easier to bare LOBs in mind if you create a
service e.g. with a rest interface (multipart data etc), on the other
hand you have to reinvent the wheel (compaction etc.)

Re: feature request and question: "BigPut" and "BigGet"

Posted by Michael Segel <ms...@hotmail.com>.
All, 

Before anyone starts to toss out silly rule of thumb numbers, I really think you need to take a step back, take a few deep breaths and relax. 

What you’re overlooking is how the data is going to be used. 

This is actually a good design problem I may include in my upcoming talk… 
Does anyone in this thread have an issue if I reference it? 


> On Mar 8, 2015, at 11:45 PM, Wilm Schumacher <wi...@gmail.com> wrote:
> 
> Am 09.03.2015 um 05:36 schrieb Wilm Schumacher:
>> ... I'm around 2.5 TB raw "LOB data", which isn't that large. Or 100
>> TB for a 10MB threshold and a medium size of 20 MB for LOBs ... or 200
>> TB for 10 MB threshold and doubled namenode RAM etc. etc.
> Damn. Error in calculation. Sry .... every result times 10. Makes it
> even more academic.
> 
> Best wishes,
> 
> Wilm
> 
> 



Re: feature request and question: "BigPut" and "BigGet"

Posted by Wilm Schumacher <wi...@gmail.com>.
Am 09.03.2015 um 05:36 schrieb Wilm Schumacher:
> ... I'm around 2.5 TB raw "LOB data", which isn't that large. Or 100
> TB for a 10MB threshold and a medium size of 20 MB for LOBs ... or 200
> TB for 10 MB threshold and doubled namenode RAM etc. etc.
Damn. Error in calculation. Sry .... every result times 10. Makes it
even more academic.

Best wishes,

Wilm


Re: feature request and question: "BigPut" and "BigGet"

Posted by Wilm Schumacher <wi...@gmail.com>.
Am 09.03.2015 um 05:01 schrieb lars hofhansl:
> Thanks for looking into this Wilm.
> I would honestly suggest just writing larger lobs directly into HDFS and just store the location in HBase.
> You can do that with a relatively simple protocol, with reasonable safety:1. Write the metadata row into HBase2. Write the LOB into HDFS3. When the LOB was written, update the metadata row with the LOBs location.4. Report success back to the client
that would be a client side approach, which of course would work, but
which has some downsides (e.g. being out of sync as you pointed out). On
the other hand ... no large change of core hbase code ;).

But of course by this the small files problem (which i'm facing) is only
solved half way through. If I use your 1MB threshold and let's say a
mean size of 5 MB of one "LOB" and the limitation to ~5M "larger" files
(due to namenode) ... I'm around 2.5 TB raw "LOB data", which isn't that
large.

Or 100 TB for a 10MB threshold and a medium size of 20 MB for LOBs ...
or 200 TB for 10 MB threshold and doubled namenode RAM etc. etc.

By this I can catch the real small stuff. But I'm still bound for "a
little larger MOBs" or "small LOBs".

However, this is still way beyond my current application problems, thus
the problem is more of an academic nature :/.

> If the LOB is small... maybe < 1mb, you'd just write it into HBase as a value (preferably into a different column family)
>
> If the process fails at #2 or #3 you'd have an orphaned file in HDFS, but those are easy to find (metadata rows for which the location is unset, and older than - say - a few days)
I would use a map red on the file names and search in the hbase => if
not found => delete. But yeah, some how in a client fashion.

> Your BigPut and BigGet could just be an API around this process.
yupp.

As two independent developers gave the same answer i'll drop the idea
and go further on the client way.

Thanks for the fast reply,

Wilm

Re: feature request and question: "BigPut" and "BigGet"

Posted by lars hofhansl <la...@apache.org>.
Thanks for looking into this Wilm.
I would honestly suggest just writing larger lobs directly into HDFS and just store the location in HBase.
You can do that with a relatively simple protocol, with reasonable safety:1. Write the metadata row into HBase2. Write the LOB into HDFS3. When the LOB was written, update the metadata row with the LOBs location.4. Report success back to the client

If the LOB is small... maybe < 1mb, you'd just write it into HBase as a value (preferably into a different column family)

If the process fails at #2 or #3 you'd have an orphaned file in HDFS, but those are easy to find (metadata rows for which the location is unset, and older than - say - a few days)

Your BigPut and BigGet could just be an API around this process.

-- Lars

     From: Wilm Schumacher <wi...@gmail.com>
 To: dev@hbase.apache.org 
 Sent: Sunday, March 8, 2015 7:55 PM
 Subject: feature request and question: "BigPut" and "BigGet"
   
Hi,

I have an idea for a feature in hbase which directly derives from the
idea of the MOB feature. As Jonathan Hsieh pointed out, the only thing
that limiting the feature to MOBs instead to LOBs is the memory
allocation on client and server side. However, the "LOB feature" would
be very handy for me and I think for some other users, too. Furthermore
the fast fetching small files problem could be solved.

The natural solution would be a "BigPut" and a "BigGet" class, which
encounter that problem, which are capable of dealing with large amount
of data without using too much memory. My plan by now is to creates
classes that do e.g.
BigPut BigPut.add( byte[] , byte[] , inputstream )
and
outputstream BigResult.value( byte[] , byte[] )
(in addition to the normal byte[] to byte[] member functions)

and pass the inputstreams through the AsyncProcess class to the RPC or
in reverse the outputstream for the BigResult class. By this plan the
client and server would have to throw out some threads to deal with
multiple streams[1].

By now I dig into the hbase-client (2.0.0) sources and I think that my
plan would be quite invasive to the existing code ... but is doable.
However, regarding the very open development model of hbase features I
think it could be adressed.

But I'm veeeery new to hbase development and just started to read the
source. Before I dig to deep into the problem I wanted to ask here if
there is any show stopper I'm missing by now?
To make a list of questions for that feature:
* As this plan probably won't break the thread model of the
hbase-client, is there any problem on the (region) server side? Or is
there any blocking/race condition problem elsewhere I miss by now?
* Is it a bad plan to pump several 100s of MB through one RPC in a
separate thread? If yes ... why?
* Are there any other fundamental problems I miss by now which makes
that a horrible plan?
* Is there already some dev onging? I didn't found something on jira.
But that doesn't mean anything :/
* Does anyone have a better name than "BigPut" :D?

And at last:
* Is it a better plan to create a separate "MOB/LOB service"?[2]

Best wishes

Wilm

[1] or one could limit the number of streams to one. By this the
threading problem would be much more simple to encounter as only one
"RPC" would be neccessary.

[2] on one hand it is easier to bare LOBs in mind if you create a
service e.g. with a rest interface (multipart data etc), on the other
hand you have to reinvent the wheel (compaction etc.)


   

Re: feature request and question: "BigPut" and "BigGet"

Posted by 张铎 <pa...@gmail.com>.
If LOB means data larger than 10MB or even 100MB, why not just use an
FileSystem instead of HBase?
For a FileSystem it already has the stream interface...

2015-03-09 10:55 GMT+08:00 Wilm Schumacher <wi...@gmail.com>:

> Hi,
>
> I have an idea for a feature in hbase which directly derives from the
> idea of the MOB feature. As Jonathan Hsieh pointed out, the only thing
> that limiting the feature to MOBs instead to LOBs is the memory
> allocation on client and server side. However, the "LOB feature" would
> be very handy for me and I think for some other users, too. Furthermore
> the fast fetching small files problem could be solved.
>
> The natural solution would be a "BigPut" and a "BigGet" class, which
> encounter that problem, which are capable of dealing with large amount
> of data without using too much memory. My plan by now is to creates
> classes that do e.g.
> BigPut BigPut.add( byte[] , byte[] , inputstream )
> and
> outputstream BigResult.value( byte[] , byte[] )
> (in addition to the normal byte[] to byte[] member functions)
>
> and pass the inputstreams through the AsyncProcess class to the RPC or
> in reverse the outputstream for the BigResult class. By this plan the
> client and server would have to throw out some threads to deal with
> multiple streams[1].
>
> By now I dig into the hbase-client (2.0.0) sources and I think that my
> plan would be quite invasive to the existing code ... but is doable.
> However, regarding the very open development model of hbase features I
> think it could be adressed.
>
> But I'm veeeery new to hbase development and just started to read the
> source. Before I dig to deep into the problem I wanted to ask here if
> there is any show stopper I'm missing by now?
> To make a list of questions for that feature:
> * As this plan probably won't break the thread model of the
> hbase-client, is there any problem on the (region) server side? Or is
> there any blocking/race condition problem elsewhere I miss by now?
> * Is it a bad plan to pump several 100s of MB through one RPC in a
> separate thread? If yes ... why?
> * Are there any other fundamental problems I miss by now which makes
> that a horrible plan?
> * Is there already some dev onging? I didn't found something on jira.
> But that doesn't mean anything :/
> * Does anyone have a better name than "BigPut" :D?
>
> And at last:
> * Is it a better plan to create a separate "MOB/LOB service"?[2]
>
> Best wishes
>
> Wilm
>
> [1] or one could limit the number of streams to one. By this the
> threading problem would be much more simple to encounter as only one
> "RPC" would be neccessary.
>
> [2] on one hand it is easier to bare LOBs in mind if you create a
> service e.g. with a rest interface (multipart data etc), on the other
> hand you have to reinvent the wheel (compaction etc.)
>