You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Michael Dagaev <mi...@gmail.com> on 2009/03/25 10:18:21 UTC

Question on write optimization

Hi, all

    Currently, we write all incoming entities using batch update.

Recently we realized that many incoming entities already exist.
So, we can check for each incoming entity if it already exists
and write only "new" entities.

In other words, we can perform more reads and less writes now.
Does it make sense?

Thank you for your cooperation,
M.

Re: Question on write optimization

Posted by schubert zhang <zs...@gmail.com>.
Hi Ryan,
Yes, the commit buffer is very useful.

Regards "I can import 880m rows in about 90 minutes on a 19 machine
cluster.", could you please tell me how many column families and qualifiers
in each row?

Thank you in advance.
Schubert

On Wed, Mar 25, 2009 at 5:29 PM, Michael Dagaev <mi...@gmail.com>wrote:

> Thanks, Ryan
>
> On Wed, Mar 25, 2009 at 11:25 AM, Ryan Rawson <ry...@gmail.com> wrote:
> > Hey,
> >
> > At the lowest level, hbase is append only, so multiple values just end up
> > taking extra space (not a big deal) and get compacted out eventually.
> >
> > It would seem to me that the cost of the reads would in many cases higher
> > than the cost of extra writes that would have to get compacted out later
> -
> > there must be a tipping point, but checking every read before write is
> > pretty brutal.  You are doubling the number of operations you must do.
> >
> > I find that commit buffering is pretty efficient - I can import 880m rows
> in
> > about 90 minutes on a 19 machine cluster.
> >
> > you can access this in Java by:
> > table.setAutoCommit(false);
> > table.setAutoCommitBuffer(12 * 1024 * 1024); // i think i got this method
> > name wrong.
> >
> > -ryan
> >
> > On Wed, Mar 25, 2009 at 2:18 AM, Michael Dagaev <
> michael.dagaev@gmail.com>wrote:
> >
> >> Hi, all
> >>
> >>    Currently, we write all incoming entities using batch update.
> >>
> >> Recently we realized that many incoming entities already exist.
> >> So, we can check for each incoming entity if it already exists
> >> and write only "new" entities.
> >>
> >> In other words, we can perform more reads and less writes now.
> >> Does it make sense?
> >>
> >> Thank you for your cooperation,
> >> M.
> >>
> >
>

Re: Question on write optimization

Posted by Michael Dagaev <mi...@gmail.com>.
Thanks, Ryan

On Wed, Mar 25, 2009 at 11:25 AM, Ryan Rawson <ry...@gmail.com> wrote:
> Hey,
>
> At the lowest level, hbase is append only, so multiple values just end up
> taking extra space (not a big deal) and get compacted out eventually.
>
> It would seem to me that the cost of the reads would in many cases higher
> than the cost of extra writes that would have to get compacted out later -
> there must be a tipping point, but checking every read before write is
> pretty brutal.  You are doubling the number of operations you must do.
>
> I find that commit buffering is pretty efficient - I can import 880m rows in
> about 90 minutes on a 19 machine cluster.
>
> you can access this in Java by:
> table.setAutoCommit(false);
> table.setAutoCommitBuffer(12 * 1024 * 1024); // i think i got this method
> name wrong.
>
> -ryan
>
> On Wed, Mar 25, 2009 at 2:18 AM, Michael Dagaev <mi...@gmail.com>wrote:
>
>> Hi, all
>>
>>    Currently, we write all incoming entities using batch update.
>>
>> Recently we realized that many incoming entities already exist.
>> So, we can check for each incoming entity if it already exists
>> and write only "new" entities.
>>
>> In other words, we can perform more reads and less writes now.
>> Does it make sense?
>>
>> Thank you for your cooperation,
>> M.
>>
>

Re: Question on write optimization

Posted by Ryan Rawson <ry...@gmail.com>.
Hey,

At the lowest level, hbase is append only, so multiple values just end up
taking extra space (not a big deal) and get compacted out eventually.

It would seem to me that the cost of the reads would in many cases higher
than the cost of extra writes that would have to get compacted out later -
there must be a tipping point, but checking every read before write is
pretty brutal.  You are doubling the number of operations you must do.

I find that commit buffering is pretty efficient - I can import 880m rows in
about 90 minutes on a 19 machine cluster.

you can access this in Java by:
table.setAutoCommit(false);
table.setAutoCommitBuffer(12 * 1024 * 1024); // i think i got this method
name wrong.

-ryan

On Wed, Mar 25, 2009 at 2:18 AM, Michael Dagaev <mi...@gmail.com>wrote:

> Hi, all
>
>    Currently, we write all incoming entities using batch update.
>
> Recently we realized that many incoming entities already exist.
> So, we can check for each incoming entity if it already exists
> and write only "new" entities.
>
> In other words, we can perform more reads and less writes now.
> Does it make sense?
>
> Thank you for your cooperation,
> M.
>