You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Weishung Chung <we...@gmail.com> on 2011/01/10 17:58:54 UTC

HTable.put(List puts) perform batch insert?

Does HTable.put(List<Put> puts) method perform a batch insert with a single
RPC call? I am going to insert a lot of values into a column family and
would like to increase the write speed.
Thank you.

Re: HTable.put(List puts) perform batch insert?

Posted by tsuna <ts...@gmail.com>.
On Mon, Jan 31, 2011 at 5:19 PM, Ryan Rawson <ry...@gmail.com> wrote:
> It just retrieves the current state of the buffer.  The buffer is
> mutated to remove successful edits as they occur, during an exception
> the ones that were determined to be successful were also removed.

htable.getWriteBuffer() isn't thread-safe (correct me if I'm wrong) so
be careful with it.

FWIW, asynchbase works differently: you get a callback for each and
every edit.  You can specify two callbacks: one for the success case,
and one callback to handle failures.  Also, it's thread-safe :)
The other cool thing in asynchbase is that it puts an upper bound on
the amount of time data can be buffered in the client.   After that
time has elapsed, the client will flush the writes to HBase.  This
improves liveness in user-facing applications by preventing edits from
sticking too long in the unflushed buffer of some client, while still
allowing for higher throughput through batching of edits.

-- 
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com

Re: HTable.put(List puts) perform batch insert?

Posted by Ryan Rawson <ry...@gmail.com>.
It just retrieves the current state of the buffer.  The buffer is
mutated to remove successful edits as they occur, during an exception
the ones that were determined to be successful were also removed.

So if you catch an exception, you can inspect this buffer and know
these puts need to be sent again.  It is possible to just retry
calling flushCommits() again as well, to add further retries beyond
the base number that the client already does.

I've done that on large map reduce import jobs, since a cluster churn
should eventually settle down, but restarting a 12 hour data import
job sucks.
-ryan




On Mon, Jan 31, 2011 at 5:13 PM, Jim X <ji...@gmail.com> wrote:
> Does Htable.getWriteBuffer() do a roll back?
>
> Jim
>
> On Mon, Jan 31, 2011 at 8:04 PM, Ryan Rawson <ry...@gmail.com> wrote:
>> When you are using the buffer, you also need to flush it:
>>
>> htable.flushCommits();
>>
>> If the call succeeds, the edits were persisted.  If at any point you
>> get exceptions, the unfinished edits are left in the write buffer and
>> htable.getWriteBuffer() gets you them.
>>
>> -ryan
>>
>> On Mon, Jan 31, 2011 at 10:48 AM, Sean Bigdatafun
>> <se...@gmail.com> wrote:
>>> On Fri, Jan 14, 2011 at 10:51 PM, tsuna <ts...@gmail.com> wrote:
>>>
>>>> On Fri, Jan 14, 2011 at 4:06 PM, Sean Bigdatafun
>>>> <se...@gmail.com> wrote:
>>>> > But how can the client understand which k-v belongs to an individual RS?
>>>> > Does it need to scan the .META. table? (if so, it's an expensive op). On
>>>> the
>>>> > RegionServer side, is it like processing multiple requests in a batch per
>>>> > RPC?
>>>>
>>>> The client has to figure out which region each edit has to go to.  The
>>>> client maintains a local cache of the META table, so when you
>>>> frequently use the same working set of regions (which is common for
>>>> most applications), the lookups are essentially free.
>>>>
>>>> The worst case is a client that does random-writes to all the regions
>>>> in a huge table.  In this case, the client will end up discovering the
>>>> location of all the regions of that table and keep this in its
>>>> in-memory cache.  But regions move around, are split etc.  This does
>>>> cause extra META lookups, but the latency for a META lookup is
>>>> typically very small (even though the penalty incurred by the client
>>>> compared to cache hits in its local META cache is huge, comparatively
>>>> speaking).  Note that right now neither HTable nor asynchbase
>>>> pro-actively evict unused entries from the local META cache to save
>>>> memory.  I don't think anyone is running HBase at a scale where this
>>>> optimization would be useful.
>>>>
>>>> If you have a write-heavy application, you're always going to get
>>>> significantly higher throughput when you send your edits in batch to
>>>> the server.  The downside to this is that when your client application
>>>> dies, you lose all the edits in the un-committed batch.  Unlike
>>>> HTable, asynchbase puts an upper bound on the amount of time an edit
>>>> is allowed to remain in the client's buffer, which helps limit
>>>> data-loss when a client crashes (OpenTSDB sets this to 1s by default,
>>>> so when it dies, you know you lost at most 1s worth of datapoints).
>>>>
>>> *
>>>
>>> setWriteBufferSize(1024*1014*10); // 10MB
>>>
>>> *
>>>
>>> *setAutoFlush(false*);
>>>
>>> for(i=0; i<N; i++) {
>>>
>>>  list.add(putitem[i]);
>>>
>>> }
>>>
>>> htable.put(list);
>>>
>>>
>>> For the above pseudo code (using put(List) to commit update in HBase), can I
>>> get a "batch transaction" success notification?
>>>       * i.e., How can I know all the items have been successfully
>>> committed? -- it seems that I can't get such information, all are
>>> best-effort. Should I know some commits fail, I can do an application-level
>>> retry.
>>>       * *setAutoFlush(true*); does not seem to help us to get any more
>>> reliable operation either.
>>>
>>>
>>>
>>>
>>>
>>>>
>>>> --
>>>> Benoit "tsuna" Sigoure
>>>> Software Engineer @ www.StumbleUpon.com
>>>>
>>>
>>>
>>>
>>> --
>>> --Sean
>>>
>>
>

Re: HTable.put(List puts) perform batch insert?

Posted by Sean Bigdatafun <se...@gmail.com>.
On Mon, Jan 31, 2011 at 5:13 PM, Jim X <ji...@gmail.com> wrote:

> Does Htable.getWriteBuffer() do a roll back?
>
>
I guess not --- this only allows you to know what has not been successfully
committed to the server after you catch the exception.

Correct me if I am wrong.

Sean


> Jim
>
> On Mon, Jan 31, 2011 at 8:04 PM, Ryan Rawson <ry...@gmail.com> wrote:
> > When you are using the buffer, you also need to flush it:
> >
> > htable.flushCommits();
> >
> > If the call succeeds, the edits were persisted.  If at any point you
> > get exceptions, the unfinished edits are left in the write buffer and
> > htable.getWriteBuffer() gets you them.
> >
> > -ryan
> >
> > On Mon, Jan 31, 2011 at 10:48 AM, Sean Bigdatafun
> > <se...@gmail.com> wrote:
> >> On Fri, Jan 14, 2011 at 10:51 PM, tsuna <ts...@gmail.com> wrote:
> >>
> >>> On Fri, Jan 14, 2011 at 4:06 PM, Sean Bigdatafun
> >>> <se...@gmail.com> wrote:
> >>> > But how can the client understand which k-v belongs to an individual
> RS?
> >>> > Does it need to scan the .META. table? (if so, it's an expensive op).
> On
> >>> the
> >>> > RegionServer side, is it like processing multiple requests in a batch
> per
> >>> > RPC?
> >>>
> >>> The client has to figure out which region each edit has to go to.  The
> >>> client maintains a local cache of the META table, so when you
> >>> frequently use the same working set of regions (which is common for
> >>> most applications), the lookups are essentially free.
> >>>
> >>> The worst case is a client that does random-writes to all the regions
> >>> in a huge table.  In this case, the client will end up discovering the
> >>> location of all the regions of that table and keep this in its
> >>> in-memory cache.  But regions move around, are split etc.  This does
> >>> cause extra META lookups, but the latency for a META lookup is
> >>> typically very small (even though the penalty incurred by the client
> >>> compared to cache hits in its local META cache is huge, comparatively
> >>> speaking).  Note that right now neither HTable nor asynchbase
> >>> pro-actively evict unused entries from the local META cache to save
> >>> memory.  I don't think anyone is running HBase at a scale where this
> >>> optimization would be useful.
> >>>
> >>> If you have a write-heavy application, you're always going to get
> >>> significantly higher throughput when you send your edits in batch to
> >>> the server.  The downside to this is that when your client application
> >>> dies, you lose all the edits in the un-committed batch.  Unlike
> >>> HTable, asynchbase puts an upper bound on the amount of time an edit
> >>> is allowed to remain in the client's buffer, which helps limit
> >>> data-loss when a client crashes (OpenTSDB sets this to 1s by default,
> >>> so when it dies, you know you lost at most 1s worth of datapoints).
> >>>
> >> *
> >>
> >> setWriteBufferSize(1024*1014*10); // 10MB
> >>
> >> *
> >>
> >> *setAutoFlush(false*);
> >>
> >> for(i=0; i<N; i++) {
> >>
> >>  list.add(putitem[i]);
> >>
> >> }
> >>
> >> htable.put(list);
> >>
> >>
> >> For the above pseudo code (using put(List) to commit update in HBase),
> can I
> >> get a "batch transaction" success notification?
> >>       * i.e., How can I know all the items have been successfully
> >> committed? -- it seems that I can't get such information, all are
> >> best-effort. Should I know some commits fail, I can do an
> application-level
> >> retry.
> >>       * *setAutoFlush(true*); does not seem to help us to get any more
> >> reliable operation either.
> >>
> >>
> >>
> >>
> >>
> >>>
> >>> --
> >>> Benoit "tsuna" Sigoure
> >>> Software Engineer @ www.StumbleUpon.com
> >>>
> >>
> >>
> >>
> >> --
> >> --Sean
> >>
> >
>



-- 
--Sean

Re: HTable.put(List puts) perform batch insert?

Posted by Jim X <ji...@gmail.com>.
Does Htable.getWriteBuffer() do a roll back?

Jim

On Mon, Jan 31, 2011 at 8:04 PM, Ryan Rawson <ry...@gmail.com> wrote:
> When you are using the buffer, you also need to flush it:
>
> htable.flushCommits();
>
> If the call succeeds, the edits were persisted.  If at any point you
> get exceptions, the unfinished edits are left in the write buffer and
> htable.getWriteBuffer() gets you them.
>
> -ryan
>
> On Mon, Jan 31, 2011 at 10:48 AM, Sean Bigdatafun
> <se...@gmail.com> wrote:
>> On Fri, Jan 14, 2011 at 10:51 PM, tsuna <ts...@gmail.com> wrote:
>>
>>> On Fri, Jan 14, 2011 at 4:06 PM, Sean Bigdatafun
>>> <se...@gmail.com> wrote:
>>> > But how can the client understand which k-v belongs to an individual RS?
>>> > Does it need to scan the .META. table? (if so, it's an expensive op). On
>>> the
>>> > RegionServer side, is it like processing multiple requests in a batch per
>>> > RPC?
>>>
>>> The client has to figure out which region each edit has to go to.  The
>>> client maintains a local cache of the META table, so when you
>>> frequently use the same working set of regions (which is common for
>>> most applications), the lookups are essentially free.
>>>
>>> The worst case is a client that does random-writes to all the regions
>>> in a huge table.  In this case, the client will end up discovering the
>>> location of all the regions of that table and keep this in its
>>> in-memory cache.  But regions move around, are split etc.  This does
>>> cause extra META lookups, but the latency for a META lookup is
>>> typically very small (even though the penalty incurred by the client
>>> compared to cache hits in its local META cache is huge, comparatively
>>> speaking).  Note that right now neither HTable nor asynchbase
>>> pro-actively evict unused entries from the local META cache to save
>>> memory.  I don't think anyone is running HBase at a scale where this
>>> optimization would be useful.
>>>
>>> If you have a write-heavy application, you're always going to get
>>> significantly higher throughput when you send your edits in batch to
>>> the server.  The downside to this is that when your client application
>>> dies, you lose all the edits in the un-committed batch.  Unlike
>>> HTable, asynchbase puts an upper bound on the amount of time an edit
>>> is allowed to remain in the client's buffer, which helps limit
>>> data-loss when a client crashes (OpenTSDB sets this to 1s by default,
>>> so when it dies, you know you lost at most 1s worth of datapoints).
>>>
>> *
>>
>> setWriteBufferSize(1024*1014*10); // 10MB
>>
>> *
>>
>> *setAutoFlush(false*);
>>
>> for(i=0; i<N; i++) {
>>
>>  list.add(putitem[i]);
>>
>> }
>>
>> htable.put(list);
>>
>>
>> For the above pseudo code (using put(List) to commit update in HBase), can I
>> get a "batch transaction" success notification?
>>       * i.e., How can I know all the items have been successfully
>> committed? -- it seems that I can't get such information, all are
>> best-effort. Should I know some commits fail, I can do an application-level
>> retry.
>>       * *setAutoFlush(true*); does not seem to help us to get any more
>> reliable operation either.
>>
>>
>>
>>
>>
>>>
>>> --
>>> Benoit "tsuna" Sigoure
>>> Software Engineer @ www.StumbleUpon.com
>>>
>>
>>
>>
>> --
>> --Sean
>>
>

Re: HTable.put(List puts) perform batch insert?

Posted by Ryan Rawson <ry...@gmail.com>.
When you are using the buffer, you also need to flush it:

htable.flushCommits();

If the call succeeds, the edits were persisted.  If at any point you
get exceptions, the unfinished edits are left in the write buffer and
htable.getWriteBuffer() gets you them.

-ryan

On Mon, Jan 31, 2011 at 10:48 AM, Sean Bigdatafun
<se...@gmail.com> wrote:
> On Fri, Jan 14, 2011 at 10:51 PM, tsuna <ts...@gmail.com> wrote:
>
>> On Fri, Jan 14, 2011 at 4:06 PM, Sean Bigdatafun
>> <se...@gmail.com> wrote:
>> > But how can the client understand which k-v belongs to an individual RS?
>> > Does it need to scan the .META. table? (if so, it's an expensive op). On
>> the
>> > RegionServer side, is it like processing multiple requests in a batch per
>> > RPC?
>>
>> The client has to figure out which region each edit has to go to.  The
>> client maintains a local cache of the META table, so when you
>> frequently use the same working set of regions (which is common for
>> most applications), the lookups are essentially free.
>>
>> The worst case is a client that does random-writes to all the regions
>> in a huge table.  In this case, the client will end up discovering the
>> location of all the regions of that table and keep this in its
>> in-memory cache.  But regions move around, are split etc.  This does
>> cause extra META lookups, but the latency for a META lookup is
>> typically very small (even though the penalty incurred by the client
>> compared to cache hits in its local META cache is huge, comparatively
>> speaking).  Note that right now neither HTable nor asynchbase
>> pro-actively evict unused entries from the local META cache to save
>> memory.  I don't think anyone is running HBase at a scale where this
>> optimization would be useful.
>>
>> If you have a write-heavy application, you're always going to get
>> significantly higher throughput when you send your edits in batch to
>> the server.  The downside to this is that when your client application
>> dies, you lose all the edits in the un-committed batch.  Unlike
>> HTable, asynchbase puts an upper bound on the amount of time an edit
>> is allowed to remain in the client's buffer, which helps limit
>> data-loss when a client crashes (OpenTSDB sets this to 1s by default,
>> so when it dies, you know you lost at most 1s worth of datapoints).
>>
> *
>
> setWriteBufferSize(1024*1014*10); // 10MB
>
> *
>
> *setAutoFlush(false*);
>
> for(i=0; i<N; i++) {
>
>  list.add(putitem[i]);
>
> }
>
> htable.put(list);
>
>
> For the above pseudo code (using put(List) to commit update in HBase), can I
> get a "batch transaction" success notification?
>       * i.e., How can I know all the items have been successfully
> committed? -- it seems that I can't get such information, all are
> best-effort. Should I know some commits fail, I can do an application-level
> retry.
>       * *setAutoFlush(true*); does not seem to help us to get any more
> reliable operation either.
>
>
>
>
>
>>
>> --
>> Benoit "tsuna" Sigoure
>> Software Engineer @ www.StumbleUpon.com
>>
>
>
>
> --
> --Sean
>

Re: HTable.put(List puts) perform batch insert?

Posted by Sean Bigdatafun <se...@gmail.com>.
On Fri, Jan 14, 2011 at 10:51 PM, tsuna <ts...@gmail.com> wrote:

> On Fri, Jan 14, 2011 at 4:06 PM, Sean Bigdatafun
> <se...@gmail.com> wrote:
> > But how can the client understand which k-v belongs to an individual RS?
> > Does it need to scan the .META. table? (if so, it's an expensive op). On
> the
> > RegionServer side, is it like processing multiple requests in a batch per
> > RPC?
>
> The client has to figure out which region each edit has to go to.  The
> client maintains a local cache of the META table, so when you
> frequently use the same working set of regions (which is common for
> most applications), the lookups are essentially free.
>
> The worst case is a client that does random-writes to all the regions
> in a huge table.  In this case, the client will end up discovering the
> location of all the regions of that table and keep this in its
> in-memory cache.  But regions move around, are split etc.  This does
> cause extra META lookups, but the latency for a META lookup is
> typically very small (even though the penalty incurred by the client
> compared to cache hits in its local META cache is huge, comparatively
> speaking).  Note that right now neither HTable nor asynchbase
> pro-actively evict unused entries from the local META cache to save
> memory.  I don't think anyone is running HBase at a scale where this
> optimization would be useful.
>
> If you have a write-heavy application, you're always going to get
> significantly higher throughput when you send your edits in batch to
> the server.  The downside to this is that when your client application
> dies, you lose all the edits in the un-committed batch.  Unlike
> HTable, asynchbase puts an upper bound on the amount of time an edit
> is allowed to remain in the client's buffer, which helps limit
> data-loss when a client crashes (OpenTSDB sets this to 1s by default,
> so when it dies, you know you lost at most 1s worth of datapoints).
>
*

setWriteBufferSize(1024*1014*10); // 10MB

*

*setAutoFlush(false*);

for(i=0; i<N; i++) {

  list.add(putitem[i]);

}

htable.put(list);


For the above pseudo code (using put(List) to commit update in HBase), can I
get a "batch transaction" success notification?
       * i.e., How can I know all the items have been successfully
committed? -- it seems that I can't get such information, all are
best-effort. Should I know some commits fail, I can do an application-level
retry.
       * *setAutoFlush(true*); does not seem to help us to get any more
reliable operation either.





>
> --
> Benoit "tsuna" Sigoure
> Software Engineer @ www.StumbleUpon.com
>



-- 
--Sean

Re: HTable.put(List puts) perform batch insert?

Posted by tsuna <ts...@gmail.com>.
On Fri, Jan 14, 2011 at 4:06 PM, Sean Bigdatafun
<se...@gmail.com> wrote:
> But how can the client understand which k-v belongs to an individual RS?
> Does it need to scan the .META. table? (if so, it's an expensive op). On the
> RegionServer side, is it like processing multiple requests in a batch per
> RPC?

The client has to figure out which region each edit has to go to.  The
client maintains a local cache of the META table, so when you
frequently use the same working set of regions (which is common for
most applications), the lookups are essentially free.

The worst case is a client that does random-writes to all the regions
in a huge table.  In this case, the client will end up discovering the
location of all the regions of that table and keep this in its
in-memory cache.  But regions move around, are split etc.  This does
cause extra META lookups, but the latency for a META lookup is
typically very small (even though the penalty incurred by the client
compared to cache hits in its local META cache is huge, comparatively
speaking).  Note that right now neither HTable nor asynchbase
pro-actively evict unused entries from the local META cache to save
memory.  I don't think anyone is running HBase at a scale where this
optimization would be useful.

If you have a write-heavy application, you're always going to get
significantly higher throughput when you send your edits in batch to
the server.  The downside to this is that when your client application
dies, you lose all the edits in the un-committed batch.  Unlike
HTable, asynchbase puts an upper bound on the amount of time an edit
is allowed to remain in the client's buffer, which helps limit
data-loss when a client crashes (OpenTSDB sets this to 1s by default,
so when it dies, you know you lost at most 1s worth of datapoints).

-- 
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com

Re: HTable.put(List puts) perform batch insert?

Posted by Sean Bigdatafun <se...@gmail.com>.
But how can the client understand which k-v belongs to an individual RS?
Does it need to scan the .META. table? (if so, it's an expensive op). On the
RegionServer side, is it like processing multiple requests in a batch per
RPC?


Can you guide us to dive it a bit more?


Thanks,
Sean



On Mon, Jan 10, 2011 at 9:38 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> HBaseHUT is used to solve he Get+Put problem, so if it's your problem
> as well then do look into it.
>
> To answer your first question, that method will group Puts by region
> server meaning that it will do anywhere between 1-n where n is the
> number of RS, and that's done in parallel.
>
> J-D
>
> On Mon, Jan 10, 2011 at 9:06 AM, Weishung Chung <we...@gmail.com>
> wrote:
> > What is the difference between the above put method with the following
> > capability of the HBaseHUT package ?
> > https://github.com/sematext/HBaseHUT
> >
> > On Mon, Jan 10, 2011 at 10:58 AM, Weishung Chung <we...@gmail.com>
> wrote:
> >
> >> Does HTable.put(List<Put> puts) method perform a batch insert with a
> single
> >> RPC call? I am going to insert a lot of values into a column family and
> >> would like to increase the write speed.
> >> Thank you.
> >>
> >
>



-- 
--Sean

RE: HTable.put(List puts) perform batch insert?

Posted by Jonathan Gray <jg...@fb.com>.
BatchUpdate is the old, deprecated version of Put.  You are using the best APIs.

> -----Original Message-----
> From: Weishung Chung [mailto:weishung@gmail.com]
> Sent: Monday, January 10, 2011 10:10 AM
> To: user@hbase.apache.org
> Subject: Re: HTable.put(List<Put> puts) perform batch insert?
> 
> Thank you :)
> Could I use org.apache.hadoop.hbase.io.BatchUpdate ? Would it be faster
> than the put(List<Put>)?
> Also, would you recommend the use of MapReduce to accomplish the
> samething?
> 
> On Mon, Jan 10, 2011 at 11:38 AM, Jean-Daniel Cryans
> <jd...@apache.org>wrote:
> 
> > HBaseHUT is used to solve he Get+Put problem, so if it's your problem
> > as well then do look into it.
> >
> > To answer your first question, that method will group Puts by region
> > server meaning that it will do anywhere between 1-n where n is the
> > number of RS, and that's done in parallel.
> >
> > J-D
> >
> > On Mon, Jan 10, 2011 at 9:06 AM, Weishung Chung
> <we...@gmail.com>
> > wrote:
> > > What is the difference between the above put method with the
> > > following capability of the HBaseHUT package ?
> > > https://github.com/sematext/HBaseHUT
> > >
> > > On Mon, Jan 10, 2011 at 10:58 AM, Weishung Chung
> > > <we...@gmail.com>
> > wrote:
> > >
> > >> Does HTable.put(List<Put> puts) method perform a batch insert with
> > >> a
> > single
> > >> RPC call? I am going to insert a lot of values into a column family
> > >> and would like to increase the write speed.
> > >> Thank you.
> > >>
> > >
> >

Re: HTable.put(List puts) perform batch insert?

Posted by Jim X <ji...@gmail.com>.
Which one do you use finally for batch process like JDBC batch?

On Tue, Jan 18, 2011 at 11:31 AM, Weishung Chung <we...@gmail.com> wrote:
> Thank you, I will look into these packages :)
>
> On Sun, Jan 16, 2011 at 4:17 AM, Otis Gospodnetic <
> otis_gospodnetic@yahoo.com> wrote:
>
>> Hi,
>>
>> Re HBaseHUT - Alex didn't mention it, but he did a really nice and clear
>> writeup
>> of it in this post:
>>
>> http://blog.sematext.com/2010/12/16/deferring-processing-updates-to-increase-hbase-write-performance/
>>
>>
>> Otis
>> ----
>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>> Lucene ecosystem search :: http://search-lucene.com/
>>
>>
>>
>> ----- Original Message ----
>> > From: Alex Baranau <al...@gmail.com>
>> > To: user@hbase.apache.org
>> > Sent: Tue, January 11, 2011 10:51:28 AM
>> > Subject: Re: HTable.put(List<Put> puts) perform batch insert?
>> >
>> > Re HBaseHUT J-D was correct: you will gain speed with it in case you need
>> > Get  & Put operation to perform your updates.
>> >
>> > Don't forget to play with  writeToWAL, writeBuffer (with autoFlush=false)
>> > attributes!
>> >
>> > Alex  Baranau
>> > ----
>> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop -
>> HBase
>> >
>> > On  Mon, Jan 10, 2011 at 10:45 PM, Weishung Chung <we...@gmail.com>
>> wrote:
>> >
>> > >  Ok, i will test it, thanks again :)
>> > >
>> > > On Mon, Jan 10, 2011 at 1:53  PM, Jean-Daniel Cryans <
>> jdcryans@apache.org
>> > >  >wrote:
>> > >
>> > > > Depending on the level of super fastness you  need, it may or may not
>> > > > be fast enough. Better to test it, as  usual.
>> > > >
>> > > > J-D
>> > > >
>> > > > On Mon, Jan 10,  2011 at 11:12 AM, Weishung Chung <
>> weishung@gmail.com>
>> > > >  wrote:
>> > > > > Multiple batches of 10k *new/updated* rows at any time  to
>> different
>> > > > tables
>> > > > > by different clients  simultaneously. I want these multiple batches
>> of
>> > > > > insertions to  be done super fast. At the same time, I would like
>> to be
>> > > >  able
>> > > > > to scale up to 100k rows at a time (the goal).  Now,  I am building
>> a
>> > > > cluster
>> > > > > of size 6 to 7  nodes.
>> > > > >
>> > > > > On Mon, Jan 10, 2011 at 1:03 PM,  Jean-Daniel Cryans <
>> > > jdcryans@apache.org
>> > > >  >wrote:
>> > > > >
>> > > > >> lotsa rows? That's 1k or 1B?  Inside a OLTP system or OLAP?
>> > > > >>
>> > > > >>  J-D
>> > > > >>
>> > > > >> On Mon, Jan 10, 2011 at 10:58  AM, Weishung Chung <
>> weishung@gmail.com>
>> > > >  >> wrote:
>> > > > >> > Jonathan, awesome, best of breed  APIs!
>> > > > >> > Jean, I would like to insert lotsa new rows  with many columns
>> in a
>> > > > >> > particular column family*  **programmatically in batch just like
>> the
>> > > > jdbc
>> > > >  >> > addBatch method.*
>> > > > >> > *Thanks  again.*
>> > > > >> >
>> > > > >> >
>> > > >  >>
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: HTable.put(List puts) perform batch insert?

Posted by Weishung Chung <we...@gmail.com>.
Thank you, I will look into these packages :)

On Sun, Jan 16, 2011 at 4:17 AM, Otis Gospodnetic <
otis_gospodnetic@yahoo.com> wrote:

> Hi,
>
> Re HBaseHUT - Alex didn't mention it, but he did a really nice and clear
> writeup
> of it in this post:
>
> http://blog.sematext.com/2010/12/16/deferring-processing-updates-to-increase-hbase-write-performance/
>
>
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> ----- Original Message ----
> > From: Alex Baranau <al...@gmail.com>
> > To: user@hbase.apache.org
> > Sent: Tue, January 11, 2011 10:51:28 AM
> > Subject: Re: HTable.put(List<Put> puts) perform batch insert?
> >
> > Re HBaseHUT J-D was correct: you will gain speed with it in case you need
> > Get  & Put operation to perform your updates.
> >
> > Don't forget to play with  writeToWAL, writeBuffer (with autoFlush=false)
> > attributes!
> >
> > Alex  Baranau
> > ----
> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop -
> HBase
> >
> > On  Mon, Jan 10, 2011 at 10:45 PM, Weishung Chung <we...@gmail.com>
> wrote:
> >
> > >  Ok, i will test it, thanks again :)
> > >
> > > On Mon, Jan 10, 2011 at 1:53  PM, Jean-Daniel Cryans <
> jdcryans@apache.org
> > >  >wrote:
> > >
> > > > Depending on the level of super fastness you  need, it may or may not
> > > > be fast enough. Better to test it, as  usual.
> > > >
> > > > J-D
> > > >
> > > > On Mon, Jan 10,  2011 at 11:12 AM, Weishung Chung <
> weishung@gmail.com>
> > > >  wrote:
> > > > > Multiple batches of 10k *new/updated* rows at any time  to
> different
> > > > tables
> > > > > by different clients  simultaneously. I want these multiple batches
> of
> > > > > insertions to  be done super fast. At the same time, I would like
> to be
> > > >  able
> > > > > to scale up to 100k rows at a time (the goal).  Now,  I am building
> a
> > > > cluster
> > > > > of size 6 to 7  nodes.
> > > > >
> > > > > On Mon, Jan 10, 2011 at 1:03 PM,  Jean-Daniel Cryans <
> > > jdcryans@apache.org
> > > >  >wrote:
> > > > >
> > > > >> lotsa rows? That's 1k or 1B?  Inside a OLTP system or OLAP?
> > > > >>
> > > > >>  J-D
> > > > >>
> > > > >> On Mon, Jan 10, 2011 at 10:58  AM, Weishung Chung <
> weishung@gmail.com>
> > > >  >> wrote:
> > > > >> > Jonathan, awesome, best of breed  APIs!
> > > > >> > Jean, I would like to insert lotsa new rows  with many columns
> in a
> > > > >> > particular column family*  **programmatically in batch just like
> the
> > > > jdbc
> > > >  >> > addBatch method.*
> > > > >> > *Thanks  again.*
> > > > >> >
> > > > >> >
> > > >  >>
> > > > >
> > > >
> > >
> >
>

Re: HTable.put(List puts) perform batch insert?

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hi,

Re HBaseHUT - Alex didn't mention it, but he did a really nice and clear writeup 
of it in this post:
http://blog.sematext.com/2010/12/16/deferring-processing-updates-to-increase-hbase-write-performance/


Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: Alex Baranau <al...@gmail.com>
> To: user@hbase.apache.org
> Sent: Tue, January 11, 2011 10:51:28 AM
> Subject: Re: HTable.put(List<Put> puts) perform batch insert?
> 
> Re HBaseHUT J-D was correct: you will gain speed with it in case you need
> Get  & Put operation to perform your updates.
> 
> Don't forget to play with  writeToWAL, writeBuffer (with autoFlush=false)
> attributes!
> 
> Alex  Baranau
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase
> 
> On  Mon, Jan 10, 2011 at 10:45 PM, Weishung Chung <we...@gmail.com> wrote:
> 
> >  Ok, i will test it, thanks again :)
> >
> > On Mon, Jan 10, 2011 at 1:53  PM, Jean-Daniel Cryans <jdcryans@apache.org
> >  >wrote:
> >
> > > Depending on the level of super fastness you  need, it may or may not
> > > be fast enough. Better to test it, as  usual.
> > >
> > > J-D
> > >
> > > On Mon, Jan 10,  2011 at 11:12 AM, Weishung Chung <we...@gmail.com>
> > >  wrote:
> > > > Multiple batches of 10k *new/updated* rows at any time  to different
> > > tables
> > > > by different clients  simultaneously. I want these multiple batches of
> > > > insertions to  be done super fast. At the same time, I would like to be
> > >  able
> > > > to scale up to 100k rows at a time (the goal).  Now,  I am building a
> > > cluster
> > > > of size 6 to 7  nodes.
> > > >
> > > > On Mon, Jan 10, 2011 at 1:03 PM,  Jean-Daniel Cryans <
> > jdcryans@apache.org
> > >  >wrote:
> > > >
> > > >> lotsa rows? That's 1k or 1B?  Inside a OLTP system or OLAP?
> > > >>
> > > >>  J-D
> > > >>
> > > >> On Mon, Jan 10, 2011 at 10:58  AM, Weishung Chung <we...@gmail.com>
> > >  >> wrote:
> > > >> > Jonathan, awesome, best of breed  APIs!
> > > >> > Jean, I would like to insert lotsa new rows  with many columns in a
> > > >> > particular column family*  **programmatically in batch just like the
> > > jdbc
> > >  >> > addBatch method.*
> > > >> > *Thanks  again.*
> > > >> >
> > > >> >
> > >  >>
> > > >
> > >
> >
> 

Re: HTable.put(List puts) perform batch insert?

Posted by Alex Baranau <al...@gmail.com>.
Re HBaseHUT J-D was correct: you will gain speed with it in case you need
Get & Put operation to perform your updates.

Don't forget to play with writeToWAL, writeBuffer (with autoFlush=false)
attributes!

Alex Baranau
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase

On Mon, Jan 10, 2011 at 10:45 PM, Weishung Chung <we...@gmail.com> wrote:

> Ok, i will test it, thanks again :)
>
> On Mon, Jan 10, 2011 at 1:53 PM, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
>
> > Depending on the level of super fastness you need, it may or may not
> > be fast enough. Better to test it, as usual.
> >
> > J-D
> >
> > On Mon, Jan 10, 2011 at 11:12 AM, Weishung Chung <we...@gmail.com>
> > wrote:
> > > Multiple batches of 10k *new/updated* rows at any time to different
> > tables
> > > by different clients simultaneously. I want these multiple batches of
> > > insertions to be done super fast. At the same time, I would like to be
> > able
> > > to scale up to 100k rows at a time (the goal).  Now, I am building a
> > cluster
> > > of size 6 to 7 nodes.
> > >
> > > On Mon, Jan 10, 2011 at 1:03 PM, Jean-Daniel Cryans <
> jdcryans@apache.org
> > >wrote:
> > >
> > >> lotsa rows? That's 1k or 1B? Inside a OLTP system or OLAP?
> > >>
> > >> J-D
> > >>
> > >> On Mon, Jan 10, 2011 at 10:58 AM, Weishung Chung <we...@gmail.com>
> > >> wrote:
> > >> > Jonathan, awesome, best of breed APIs!
> > >> > Jean, I would like to insert lotsa new rows with many columns in a
> > >> > particular column family* **programmatically in batch just like the
> > jdbc
> > >> > addBatch method.*
> > >> > *Thanks again.*
> > >> >
> > >> >
> > >>
> > >
> >
>

Re: HTable.put(List puts) perform batch insert?

Posted by Weishung Chung <we...@gmail.com>.
Ok, i will test it, thanks again :)

On Mon, Jan 10, 2011 at 1:53 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> Depending on the level of super fastness you need, it may or may not
> be fast enough. Better to test it, as usual.
>
> J-D
>
> On Mon, Jan 10, 2011 at 11:12 AM, Weishung Chung <we...@gmail.com>
> wrote:
> > Multiple batches of 10k *new/updated* rows at any time to different
> tables
> > by different clients simultaneously. I want these multiple batches of
> > insertions to be done super fast. At the same time, I would like to be
> able
> > to scale up to 100k rows at a time (the goal).  Now, I am building a
> cluster
> > of size 6 to 7 nodes.
> >
> > On Mon, Jan 10, 2011 at 1:03 PM, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
> >
> >> lotsa rows? That's 1k or 1B? Inside a OLTP system or OLAP?
> >>
> >> J-D
> >>
> >> On Mon, Jan 10, 2011 at 10:58 AM, Weishung Chung <we...@gmail.com>
> >> wrote:
> >> > Jonathan, awesome, best of breed APIs!
> >> > Jean, I would like to insert lotsa new rows with many columns in a
> >> > particular column family* **programmatically in batch just like the
> jdbc
> >> > addBatch method.*
> >> > *Thanks again.*
> >> >
> >> >
> >>
> >
>

Re: HTable.put(List puts) perform batch insert?

Posted by Jean-Daniel Cryans <jd...@apache.org>.
Depending on the level of super fastness you need, it may or may not
be fast enough. Better to test it, as usual.

J-D

On Mon, Jan 10, 2011 at 11:12 AM, Weishung Chung <we...@gmail.com> wrote:
> Multiple batches of 10k *new/updated* rows at any time to different tables
> by different clients simultaneously. I want these multiple batches of
> insertions to be done super fast. At the same time, I would like to be able
> to scale up to 100k rows at a time (the goal).  Now, I am building a cluster
> of size 6 to 7 nodes.
>
> On Mon, Jan 10, 2011 at 1:03 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:
>
>> lotsa rows? That's 1k or 1B? Inside a OLTP system or OLAP?
>>
>> J-D
>>
>> On Mon, Jan 10, 2011 at 10:58 AM, Weishung Chung <we...@gmail.com>
>> wrote:
>> > Jonathan, awesome, best of breed APIs!
>> > Jean, I would like to insert lotsa new rows with many columns in a
>> > particular column family* **programmatically in batch just like the jdbc
>> > addBatch method.*
>> > *Thanks again.*
>> >
>> >
>>
>

Re: HTable.put(List puts) perform batch insert?

Posted by Stack <sa...@gmail.com>.
It would be interesting to hear your experience w asynchronous hbase client (it is used extensively at su where a few of us hbase committers work)

Stack



On Jan 14, 2011, at 2:21, tsuna <ts...@gmail.com> wrote:

> On Mon, Jan 10, 2011 at 11:12 AM, Weishung Chung <we...@gmail.com> wrote:
>> Multiple batches of 10k *new/updated* rows at any time to different tables
>> by different clients simultaneously. I want these multiple batches of
>> insertions to be done super fast. At the same time, I would like to be able
>> to scale up to 100k rows at a time (the goal).  Now, I am building a cluster
>> of size 6 to 7 nodes.
> 
> If you're writing a multi-threaded client and you're going to have
> many clients like this writing to HBase continuously, I recommend
> writing your application with asynchbase
> (http://github.com/stumbleupon/asynchbase) instead.  It's an alternate
> HBase client library I wrote and in my application it significantly
> increased write throughput.  It can easily push 150k updates per
> second to a 20-node cluster – and then it's the local machine that's
> CPU bound, not the HBase cluster (the local machine is a very slow VM
> so it doesn't have a lot of horsepower).  This client is especially
> good for throughput oriented workloads and was written to be
> thread-safe from the ground up (unlike HTable).
> 
> -- 
> Benoit "tsuna" Sigoure
> Software Engineer @ www.StumbleUpon.com

Re: HTable.put(List puts) perform batch insert?

Posted by tsuna <ts...@gmail.com>.
On Mon, Jan 10, 2011 at 11:12 AM, Weishung Chung <we...@gmail.com> wrote:
> Multiple batches of 10k *new/updated* rows at any time to different tables
> by different clients simultaneously. I want these multiple batches of
> insertions to be done super fast. At the same time, I would like to be able
> to scale up to 100k rows at a time (the goal).  Now, I am building a cluster
> of size 6 to 7 nodes.

If you're writing a multi-threaded client and you're going to have
many clients like this writing to HBase continuously, I recommend
writing your application with asynchbase
(http://github.com/stumbleupon/asynchbase) instead.  It's an alternate
HBase client library I wrote and in my application it significantly
increased write throughput.  It can easily push 150k updates per
second to a 20-node cluster – and then it's the local machine that's
CPU bound, not the HBase cluster (the local machine is a very slow VM
so it doesn't have a lot of horsepower).  This client is especially
good for throughput oriented workloads and was written to be
thread-safe from the ground up (unlike HTable).

-- 
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com

Re: HTable.put(List puts) perform batch insert?

Posted by Weishung Chung <we...@gmail.com>.
Multiple batches of 10k *new/updated* rows at any time to different tables
by different clients simultaneously. I want these multiple batches of
insertions to be done super fast. At the same time, I would like to be able
to scale up to 100k rows at a time (the goal).  Now, I am building a cluster
of size 6 to 7 nodes.

On Mon, Jan 10, 2011 at 1:03 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> lotsa rows? That's 1k or 1B? Inside a OLTP system or OLAP?
>
> J-D
>
> On Mon, Jan 10, 2011 at 10:58 AM, Weishung Chung <we...@gmail.com>
> wrote:
> > Jonathan, awesome, best of breed APIs!
> > Jean, I would like to insert lotsa new rows with many columns in a
> > particular column family* **programmatically in batch just like the jdbc
> > addBatch method.*
> > *Thanks again.*
> >
> >
>

Re: HTable.put(List puts) perform batch insert?

Posted by Jean-Daniel Cryans <jd...@apache.org>.
lotsa rows? That's 1k or 1B? Inside a OLTP system or OLAP?

J-D

On Mon, Jan 10, 2011 at 10:58 AM, Weishung Chung <we...@gmail.com> wrote:
> Jonathan, awesome, best of breed APIs!
> Jean, I would like to insert lotsa new rows with many columns in a
> particular column family* **programmatically in batch just like the jdbc
> addBatch method.*
> *Thanks again.*
>
>

Re: HTable.put(List puts) perform batch insert?

Posted by Weishung Chung <we...@gmail.com>.
Jonathan, awesome, best of breed APIs!
Jean, I would like to insert lotsa new rows with many columns in a
particular column family* **programmatically in batch just like the jdbc
addBatch method.*
*Thanks again.*


On Mon, Jan 10, 2011 at 12:44 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> BatchUpdate is deprecated and gone after 0.20, also the name was
> misleading because it was batching edits on multiple columns but not
> rows.
>
> If I'm guessing correctly, you want to do an initial import of your
> data? The brute force way is to write a MR job but I would first
> recommend that you look into using the bulk uploader tools such as
> http://hbase.apache.org/docs/r0.89.20100924/bulk-loads.html
>
> J-D
>
> On Mon, Jan 10, 2011 at 10:10 AM, Weishung Chung <we...@gmail.com>
> wrote:
> > Thank you :)
> > Could I use org.apache.hadoop.hbase.io.BatchUpdate ? Would it be faster
> than
> > the put(List<Put>)?
> > Also, would you recommend the use of MapReduce to accomplish the
> samething?
> >
> > On Mon, Jan 10, 2011 at 11:38 AM, Jean-Daniel Cryans <
> jdcryans@apache.org>wrote:
> >
> >> HBaseHUT is used to solve he Get+Put problem, so if it's your problem
> >> as well then do look into it.
> >>
> >> To answer your first question, that method will group Puts by region
> >> server meaning that it will do anywhere between 1-n where n is the
> >> number of RS, and that's done in parallel.
> >>
> >> J-D
> >>
> >> On Mon, Jan 10, 2011 at 9:06 AM, Weishung Chung <we...@gmail.com>
> >> wrote:
> >> > What is the difference between the above put method with the following
> >> > capability of the HBaseHUT package ?
> >> > https://github.com/sematext/HBaseHUT
> >> >
> >> > On Mon, Jan 10, 2011 at 10:58 AM, Weishung Chung <we...@gmail.com>
> >> wrote:
> >> >
> >> >> Does HTable.put(List<Put> puts) method perform a batch insert with a
> >> single
> >> >> RPC call? I am going to insert a lot of values into a column family
> and
> >> >> would like to increase the write speed.
> >> >> Thank you.
> >> >>
> >> >
> >>
> >
>

Re: HTable.put(List puts) perform batch insert?

Posted by Jean-Daniel Cryans <jd...@apache.org>.
BatchUpdate is deprecated and gone after 0.20, also the name was
misleading because it was batching edits on multiple columns but not
rows.

If I'm guessing correctly, you want to do an initial import of your
data? The brute force way is to write a MR job but I would first
recommend that you look into using the bulk uploader tools such as
http://hbase.apache.org/docs/r0.89.20100924/bulk-loads.html

J-D

On Mon, Jan 10, 2011 at 10:10 AM, Weishung Chung <we...@gmail.com> wrote:
> Thank you :)
> Could I use org.apache.hadoop.hbase.io.BatchUpdate ? Would it be faster than
> the put(List<Put>)?
> Also, would you recommend the use of MapReduce to accomplish the samething?
>
> On Mon, Jan 10, 2011 at 11:38 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:
>
>> HBaseHUT is used to solve he Get+Put problem, so if it's your problem
>> as well then do look into it.
>>
>> To answer your first question, that method will group Puts by region
>> server meaning that it will do anywhere between 1-n where n is the
>> number of RS, and that's done in parallel.
>>
>> J-D
>>
>> On Mon, Jan 10, 2011 at 9:06 AM, Weishung Chung <we...@gmail.com>
>> wrote:
>> > What is the difference between the above put method with the following
>> > capability of the HBaseHUT package ?
>> > https://github.com/sematext/HBaseHUT
>> >
>> > On Mon, Jan 10, 2011 at 10:58 AM, Weishung Chung <we...@gmail.com>
>> wrote:
>> >
>> >> Does HTable.put(List<Put> puts) method perform a batch insert with a
>> single
>> >> RPC call? I am going to insert a lot of values into a column family and
>> >> would like to increase the write speed.
>> >> Thank you.
>> >>
>> >
>>
>

Re: HTable.put(List puts) perform batch insert?

Posted by Weishung Chung <we...@gmail.com>.
Thank you :)
Could I use org.apache.hadoop.hbase.io.BatchUpdate ? Would it be faster than
the put(List<Put>)?
Also, would you recommend the use of MapReduce to accomplish the samething?

On Mon, Jan 10, 2011 at 11:38 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> HBaseHUT is used to solve he Get+Put problem, so if it's your problem
> as well then do look into it.
>
> To answer your first question, that method will group Puts by region
> server meaning that it will do anywhere between 1-n where n is the
> number of RS, and that's done in parallel.
>
> J-D
>
> On Mon, Jan 10, 2011 at 9:06 AM, Weishung Chung <we...@gmail.com>
> wrote:
> > What is the difference between the above put method with the following
> > capability of the HBaseHUT package ?
> > https://github.com/sematext/HBaseHUT
> >
> > On Mon, Jan 10, 2011 at 10:58 AM, Weishung Chung <we...@gmail.com>
> wrote:
> >
> >> Does HTable.put(List<Put> puts) method perform a batch insert with a
> single
> >> RPC call? I am going to insert a lot of values into a column family and
> >> would like to increase the write speed.
> >> Thank you.
> >>
> >
>

Re: HTable.put(List puts) perform batch insert?

Posted by Jean-Daniel Cryans <jd...@apache.org>.
HBaseHUT is used to solve he Get+Put problem, so if it's your problem
as well then do look into it.

To answer your first question, that method will group Puts by region
server meaning that it will do anywhere between 1-n where n is the
number of RS, and that's done in parallel.

J-D

On Mon, Jan 10, 2011 at 9:06 AM, Weishung Chung <we...@gmail.com> wrote:
> What is the difference between the above put method with the following
> capability of the HBaseHUT package ?
> https://github.com/sematext/HBaseHUT
>
> On Mon, Jan 10, 2011 at 10:58 AM, Weishung Chung <we...@gmail.com> wrote:
>
>> Does HTable.put(List<Put> puts) method perform a batch insert with a single
>> RPC call? I am going to insert a lot of values into a column family and
>> would like to increase the write speed.
>> Thank you.
>>
>

Re: HTable.put(List puts) perform batch insert?

Posted by Weishung Chung <we...@gmail.com>.
What is the difference between the above put method with the following
capability of the HBaseHUT package ?
https://github.com/sematext/HBaseHUT

On Mon, Jan 10, 2011 at 10:58 AM, Weishung Chung <we...@gmail.com> wrote:

> Does HTable.put(List<Put> puts) method perform a batch insert with a single
> RPC call? I am going to insert a lot of values into a column family and
> would like to increase the write speed.
> Thank you.
>