You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Jim Kellerman (JIRA)" <ji...@apache.org> on 2007/07/10 19:22:05 UTC
[jira] Work started: (HADOOP-1468) Add HBase batch update to reduce
RPC overhead
[ https://issues.apache.org/jira/browse/HADOOP-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Work on HADOOP-1468 started by Jim Kellerman.
> Add HBase batch update to reduce RPC overhead
> ---------------------------------------------
>
> Key: HADOOP-1468
> URL: https://issues.apache.org/jira/browse/HADOOP-1468
> Project: Hadoop
> Issue Type: New Feature
> Components: contrib/hbase
> Affects Versions: 0.14.0
> Reporter: Jim Kellerman
> Assignee: Jim Kellerman
> Fix For: 0.14.0
>
>
> On Wed, 2007-06-06 at 10:05 -0700, James Kennedy wrote:
> Hi,
> >
> > I'm noticing that since the HClient/HRegionServer interface only allows
> > for a per-column put(), there is a lot of RPC and some lease management
> > overhead when writing large amounts of data. For example:
> >
> > for (int i = 0; i < 10000; i++) {
> > Text rowKey = new Text(i+"");
> > long lock = client.startUpdate(rowKey);
> > client.put(lock, COL1, rowKey.getBytes());
> > client.put(lock, COL2, someValue.getBytes());
> > client.commit(lock);
> > }
> >
> > This code takes my machine (using a single HMaster/HRegionServer on
> > local filesystem) approximately 13 seconds to execute. When i measure
> > the execution time within HRegionServer.put() I get total time spent in
> > put() < 2 seconds. So it looks like there's definately overhead in the
> > RPC communication and serialization/deserialization between client and
> > server.
> >
> > To write 10000 rows, 10000 x (startUpdate=1 + #cols=2 + commit=1) =
> > 40000 RPC operations.
> >
> > What I'm thinking, and please tell me if i'm wrong or if this is already
> > in the works, is that if I create a row-level put() method that submits
> > a map of column values at once, I would reduce the 2 + (#cols) RPC
> > operations to one single atomic row-write RPC as well as eliminate the
> > small but noticeable overhead in lease creation, renewal, and cancellation.
> >
> > It's not clear exactly what the performance improvement would be. The
> > same amount of serialization/deserilalization must occur, but YourKit
> > profiling tells me that the serialization overhead is negligible.
> >
> > Any thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.