You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by "Jim Kellerman (JIRA)" <ji...@apache.org> on 2007/07/16 09:07:14 UTC

[jira] Updated: (HADOOP-1468) Add HBase batch update to reduce RPC overhead

     [ https://issues.apache.org/jira/browse/HADOOP-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jim Kellerman updated HADOOP-1468:
----------------------------------

    Attachment: patch.txt

Works in my environment. Ensure Hudson agrees.

> Add HBase batch update to reduce RPC overhead
> ---------------------------------------------
>
>                 Key: HADOOP-1468
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1468
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: contrib/hbase
>    Affects Versions: 0.14.0
>            Reporter: Jim Kellerman
>            Assignee: Jim Kellerman
>             Fix For: 0.14.0
>
>         Attachments: patch.txt
>
>
> On Wed, 2007-06-06 at 10:05 -0700, James Kennedy wrote:
> Hi,
> > 
> > I'm noticing that since the HClient/HRegionServer interface only allows 
> > for a per-column put(), there is a lot of RPC and some lease management 
> > overhead when writing large amounts of data. For example:
> > 
> >         for (int i = 0; i < 10000; i++) {
> >             Text rowKey = new Text(i+"");
> >             long lock = client.startUpdate(rowKey);
> >             client.put(lock, COL1, rowKey.getBytes());
> >             client.put(lock, COL2, someValue.getBytes());
> >             client.commit(lock);
> >         }
> > 
> > This code takes my machine (using a single HMaster/HRegionServer on 
> > local filesystem) approximately 13 seconds to execute. When i measure 
> > the execution time within HRegionServer.put() I get total time spent in 
> > put() < 2 seconds. So it looks like there's definately overhead in the 
> > RPC communication and serialization/deserialization between client and 
> > server. 
> > 
> > To write 10000 rows, 10000 x (startUpdate=1 +  #cols=2 + commit=1) = 
> > 40000 RPC operations.
> > 
> > What I'm thinking, and please tell me if i'm wrong or if this is already 
> > in the works, is that if I create a row-level put() method that submits 
> > a map of column values at once, I would reduce the 2 + (#cols) RPC 
> > operations to one single atomic row-write RPC as well as eliminate the 
> > small but noticeable overhead in lease creation, renewal, and cancellation.
> > 
> > It's not clear exactly what the performance improvement would be. The 
> > same amount of serialization/deserilalization must occur, but YourKit 
> > profiling tells me that the serialization overhead is negligible.
> > 
> > Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.