You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Jim Kellerman (JIRA)" <ji...@apache.org> on 2007/07/10 19:27:05 UTC
[jira] Commented: (HADOOP-1468) Add HBase batch update to reduce RPC overhead

    [ https://issues.apache.org/jira/browse/HADOOP-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12511500 ] 

Jim Kellerman commented on HADOOP-1468:
---------------------------------------

> Another possibility is batch update of multiple rows wherein the client buffers up a number
> of row updates and flushes them out together.

This may still require multiple RPCs if the rows being updated are in different regions and are being served by different servers. However, the client could split the batch into per-server chunks, which would still greatly reduce the number of RPCs.


> Add HBase batch update to reduce RPC overhead
> ---------------------------------------------
>
>                 Key: HADOOP-1468
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1468
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: contrib/hbase
>    Affects Versions: 0.14.0
>            Reporter: Jim Kellerman
>            Assignee: Jim Kellerman
>             Fix For: 0.14.0
>
>
> On Wed, 2007-06-06 at 10:05 -0700, James Kennedy wrote:
> Hi,
> > 
> > I'm noticing that since the HClient/HRegionServer interface only allows 
> > for a per-column put(), there is a lot of RPC and some lease management 
> > overhead when writing large amounts of data. For example:
> > 
> >         for (int i = 0; i < 10000; i++) {
> >             Text rowKey = new Text(i+"");
> >             long lock = client.startUpdate(rowKey);
> >             client.put(lock, COL1, rowKey.getBytes());
> >             client.put(lock, COL2, someValue.getBytes());
> >             client.commit(lock);
> >         }
> > 
> > This code takes my machine (using a single HMaster/HRegionServer on 
> > local filesystem) approximately 13 seconds to execute. When i measure 
> > the execution time within HRegionServer.put() I get total time spent in 
> > put() < 2 seconds. So it looks like there's definately overhead in the 
> > RPC communication and serialization/deserialization between client and 
> > server. 
> > 
> > To write 10000 rows, 10000 x (startUpdate=1 +  #cols=2 + commit=1) = 
> > 40000 RPC operations.
> > 
> > What I'm thinking, and please tell me if i'm wrong or if this is already 
> > in the works, is that if I create a row-level put() method that submits 
> > a map of column values at once, I would reduce the 2 + (#cols) RPC 
> > operations to one single atomic row-write RPC as well as eliminate the 
> > small but noticeable overhead in lease creation, renewal, and cancellation.
> > 
> > It's not clear exactly what the performance improvement would be. The 
> > same amount of serialization/deserilalization must occur, but YourKit 
> > profiling tells me that the serialization overhead is negligible.
> > 
> > Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.