You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Rohit Kelkar <ro...@gmail.com> on 2012/01/27 09:42:52 UTC

advice needed on storing large objects on hdfs

Hi,
I am using hbase to store java objects. The objects implement the
Writable interface. The size of objects to be stored in each row
ranges from a few kb to ~50 Mb. The strategy that I am planning to use
is
if object size < 5Mb
store it in hbase
else
store it on hdfs and insert its hdfs location in hbase

While storing the objects I am using
WritableUtils.toByteArray(myObject) method. Can I use the
WritableUtils.toByteArray(myObject).length to determine if the object
should go in hbase or hdfs? Is this an acceptable strategy? Is the 5
MB limit a safe enough threshold?

- Rohit Kelkar

Re: advice needed on storing large objects on hdfs

Posted by Rohit Kelkar <ro...@gmail.com>.
Ioan, Sorry for messing up your name. Your strategy sounds
interesting. I will try that out and post the results/problems if and
when ...

- Rohit Kelkar

On Mon, Jan 30, 2012 at 1:41 PM, Ioan Eugen Stan <st...@gmail.com> wrote:
> Pe 30.01.2012 09:53, Rohit Kelkar a scris:
>
>> Hi Stack,
>> My problem is that I have large number of smaller objects and a few
>> larger objects. My strategy is to store smaller objects (size<  5MB)
>> in hbase and larger objects (size>  5MB) on hdfs. And I also want to
>> run MapReduce tasks on those objects. Loan suggested that I should put
>> all objects in a MapFile/SequenceFile on hdfs and insert in to hbase
>> the reference of the object stored in the file. Now if I run a
>> mapreduce task, my mapper would be run locally wrt the object
>> references and not the actual dfs block where the object resides.
>>
>> - Rohit Kelkar
>
>
> Hi Rohit,
>
> First my name is Ioan (with i), second. It's a tricky question. If you run
> MapReduce with input from HBase you will have data locality for HBase data
> and not from the data in your SequenceFiles. You could get data locality
> from those if you perform a pre-setup job that scans HBase and builds a list
> of files to process and then runs another MR job on Hadoop targeting the
> SequenceFile. I think you can find ways to optimize the pre-process step to
> be fast.
>
> The set-up that I described is more suitable for situations when you need to
> stream data that it's larger then a HBase is recommended to handle like
> mailboxes with large attachements. I'm planning to implement it soon in
> Apache James's HBase mailbox implementation to deal with large inboxes.
>
> Cheers,
>
>
> --
> Ioan Eugen Stan
> http://ieugen.blogspot.com

Re: advice needed on storing large objects on hdfs

Posted by Ioan Eugen Stan <st...@gmail.com>.
Pe 30.01.2012 09:53, Rohit Kelkar a scris:
> Hi Stack,
> My problem is that I have large number of smaller objects and a few
> larger objects. My strategy is to store smaller objects (size<  5MB)
> in hbase and larger objects (size>  5MB) on hdfs. And I also want to
> run MapReduce tasks on those objects. Loan suggested that I should put
> all objects in a MapFile/SequenceFile on hdfs and insert in to hbase
> the reference of the object stored in the file. Now if I run a
> mapreduce task, my mapper would be run locally wrt the object
> references and not the actual dfs block where the object resides.
>
> - Rohit Kelkar

Hi Rohit,

First my name is Ioan (with i), second. It's a tricky question. If you 
run MapReduce with input from HBase you will have data locality for 
HBase data and not from the data in your SequenceFiles. You could get 
data locality from those if you perform a pre-setup job that scans HBase 
and builds a list of files to process and then runs another MR job on 
Hadoop targeting the SequenceFile. I think you can find ways to optimize 
the pre-process step to be fast.

The set-up that I described is more suitable for situations when you 
need to stream data that it's larger then a HBase is recommended to 
handle like mailboxes with large attachements. I'm planning to implement 
it soon in Apache James's HBase mailbox implementation to deal with 
large inboxes.

Cheers,

-- 
Ioan Eugen Stan
http://ieugen.blogspot.com

Re: advice needed on storing large objects on hdfs

Posted by Rohit Kelkar <ro...@gmail.com>.
Hi Stack,
My problem is that I have large number of smaller objects and a few
larger objects. My strategy is to store smaller objects (size < 5MB)
in hbase and larger objects (size > 5MB) on hdfs. And I also want to
run MapReduce tasks on those objects. Loan suggested that I should put
all objects in a MapFile/SequenceFile on hdfs and insert in to hbase
the reference of the object stored in the file. Now if I run a
mapreduce task, my mapper would be run locally wrt the object
references and not the actual dfs block where the object resides.

- Rohit Kelkar

On Mon, Jan 30, 2012 at 11:12 AM, Stack <st...@duboce.net> wrote:
> On Sun, Jan 29, 2012 at 8:36 PM, Rohit Kelkar <ro...@gmail.com> wrote:
>> Hi Loan, this seems interesting. But in your approach I have a follow
>> up question -  would I be able to take advantage of data locality
>> while running map-reduce tasks? My understanding is that the locality
>> would be with respect to the references to those objects and not the
>> actual objects themselves.
>>
>
> If you are mapreducing against hbase, locality should be respected:
> i.e. the <5mb objects should be on the local regionserver.  The bigger
> stuff maybe local -- it would depend on how it was written.
>
> Not sure what you mean above when you talk of references vs actual objects.
>
> Yours,
> St.Ack

Re: advice needed on storing large objects on hdfs

Posted by Stack <st...@duboce.net>.
On Sun, Jan 29, 2012 at 8:36 PM, Rohit Kelkar <ro...@gmail.com> wrote:
> Hi Loan, this seems interesting. But in your approach I have a follow
> up question -  would I be able to take advantage of data locality
> while running map-reduce tasks? My understanding is that the locality
> would be with respect to the references to those objects and not the
> actual objects themselves.
>

If you are mapreducing against hbase, locality should be respected:
i.e. the <5mb objects should be on the local regionserver.  The bigger
stuff maybe local -- it would depend on how it was written.

Not sure what you mean above when you talk of references vs actual objects.

Yours,
St.Ack

Re: advice needed on storing large objects on hdfs

Posted by Rohit Kelkar <ro...@gmail.com>.
Hi Loan, this seems interesting. But in your approach I have a follow
up question -  would I be able to take advantage of data locality
while running map-reduce tasks? My understanding is that the locality
would be with respect to the references to those objects and not the
actual objects themselves.

- Rohit Kelkar

On Fri, Jan 27, 2012 at 4:21 PM, Ioan Eugen Stan <st...@gmail.com> wrote:
> Hello Rohit,
>
> I would try to write most objects in a Hadoop Sequence file or a MapFile and
> store the index/byte offeset in HBase.
>
> When reading: open the file seek() to the position and start reading the
> key:value. I don't think that using toByteArray() is good because, I think,
> you are creating a copy of the object in memory. If it's big you will end up
> with two instances of them. Try to stream the object directly to disk.
>
> I don't know if 5mb is good or not, I hope someone can shed some light.
>
> If the objects are changing: append to the SequenceFile and update the
> reference in HBase. From time to time run a MR job that cleans the file.
>
> You can use ZooKeeper to coordinate writing to many Sequence Files.
>
> If you go this way, please post your results.
>
> Cheers,
>
> Pe 27.01.2012 10:42, Rohit Kelkar a scris:
>
>> Hi,
>> I am using hbase to store java objects. The objects implement the
>> Writable interface. The size of objects to be stored in each row
>> ranges from a few kb to ~50 Mb. The strategy that I am planning to use
>> is
>> if object size<  5Mb
>> store it in hbase
>> else
>> store it on hdfs and insert its hdfs location in hbase
>>
>> While storing the objects I am using
>> WritableUtils.toByteArray(myObject) method. Can I use the
>> WritableUtils.toByteArray(myObject).length to determine if the object
>> should go in hbase or hdfs? Is this an acceptable strategy? Is the 5
>> MB limit a safe enough threshold?
>>
>> - Rohit Kelkar
>
>
>
> --
> Ioan Eugen Stan
> http://ieugen.blogspot.com

Re: advice needed on storing large objects on hdfs

Posted by Ioan Eugen Stan <st...@gmail.com>.
Hello Rohit,

I would try to write most objects in a Hadoop Sequence file or a MapFile 
and store the index/byte offeset in HBase.

When reading: open the file seek() to the position and start reading the 
key:value. I don't think that using toByteArray() is good because, I 
think, you are creating a copy of the object in memory. If it's big you 
will end up with two instances of them. Try to stream the object 
directly to disk.

I don't know if 5mb is good or not, I hope someone can shed some light.

If the objects are changing: append to the SequenceFile and update the 
reference in HBase. From time to time run a MR job that cleans the file.

You can use ZooKeeper to coordinate writing to many Sequence Files.

If you go this way, please post your results.

Cheers,

Pe 27.01.2012 10:42, Rohit Kelkar a scris:
> Hi,
> I am using hbase to store java objects. The objects implement the
> Writable interface. The size of objects to be stored in each row
> ranges from a few kb to ~50 Mb. The strategy that I am planning to use
> is
> if object size<  5Mb
> store it in hbase
> else
> store it on hdfs and insert its hdfs location in hbase
>
> While storing the objects I am using
> WritableUtils.toByteArray(myObject) method. Can I use the
> WritableUtils.toByteArray(myObject).length to determine if the object
> should go in hbase or hdfs? Is this an acceptable strategy? Is the 5
> MB limit a safe enough threshold?
>
> - Rohit Kelkar


-- 
Ioan Eugen Stan
http://ieugen.blogspot.com