You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hama.apache.org by Samuel Guo <gu...@gmail.com> on 2008/10/01 12:11:34 UTC

Re: bulk load in hbase

On Mon, Sep 29, 2008 at 7:36 PM, Edward J. Yoon <ed...@apache.org>wrote:

> > but still, if the matrix is huge(many rows, many columns), the loading
> will
> > cause a lot of matrix-table split actions. Is it right?
>
> Yes, but
>
> >> finally, we can split the matrix's table in hbase first and let
> >> matrix-loading parallely without splitting again.
>
> I don't understand exactly. Do you mean that create tablets directly
> by pre-splitting and assign them to region server?
>
> Then, this is a role of the the hbase. The merge/split are be issued
> after compaction. I guess it will be same with a hbase compaction
> mechanism.


yes. it is the role of hbase.

>
>
> /Edward
>
> On Mon, Sep 29, 2008 at 7:21 PM, Samuel Guo <gu...@gmail.com> wrote:
> > On Mon, Sep 29, 2008 at 12:43 PM, Edward J. Yoon <edwardyoon@apache.org
> >wrote:
> >
> >> The table is nothing more and nothing less than a matrix. So, we can
> >> think about bulk load such as
> >> http://wiki.apache.org/hadoop/Hbase/MapReduce
> >
> >
> > Yes. MapReduce should be used to load a matrix.
> > but still, if the matrix is huge(many rows, many columns), the loading
> will
> > cause a lot of matrix-table split actions. Is it right?
> >
> >>
> >>
> >> And, I think we can provides some regular format to store the matrix
> >> such as hadoop SquenceFileFormat.
> >
> >
> > It is great!
> >
> >
> >>
> >>
> >> Then, file->matrix, matrix->file, matrix operations,..., all done.
> >>
> >> /Edward
> >>
> >> On Fri, Sep 26, 2008 at 11:26 PM, Samuel Guo <gu...@gmail.com>
> wrote:
> >> > hi all,
> >> >
> >> > I am considering about how to use map/reudce to bulk-load a matrix
> from a
> >> > file.
> >> >
> >> > we can split the file, and let many mappers to load part of the file.
> but
> >> > lots of region-split will happen while loading if the matrix is huge.
> It
> >> may
> >> > affect the matrix load performance.
> >> >
> >> > I think that a file that stores a matrix may be regular.
> >> > without compression, it may be as below.
> >> > d11 d12 d13 .................... d1m
> >> > d21 d22 d23 .................... d2m
> >> > .............................................
> >> > dn1 dn2 dn3......................dnm
> >> >
> >> > An Optimization method will be:
> >> > (1) read a line from the matrix file, we may know it's row-size.
> assume
> >> it
> >> > is RS.
> >> > (2) we can get the file size from filesystem's metadata know the
> >> file-size.
> >> > assume it is FS.
> >> > (3) we can do a computation to got the number of rows. N(R) = FS/RS.
> >> > (4) if we know the rows, we can estimate the number of regions of the
> >> > matrix.
> >> > finally, we can split the matrix's table in hbase first and let
> >> > matrix-loading parallely without splitting again.
> >> >
> >> > certainly, no one will store a matrix as above in file. some
> compression
> >> > will be used to store a dense or sparse matrix.
> >> > but even in a compressed matrix-file, we still can pay little to
> estimate
> >> > the number of regions of the matrix and gain more performance
> improvement
> >> of
> >> > matrix-bulk-loading.
> >> >
> >> > Am I right?
> >> >
> >> > regards,
> >> >
> >> > samuel
> >> >
> >>
> >>
> >>
> >> --
> >> Best regards, Edward J. Yoon
> >> edwardyoon@apache.org
> >> http://blog.udanax.org
> >>
> >
>
>
>
> --
> Best regards, Edward J. Yoon
> edwardyoon@apache.org
> http://blog.udanax.org
>