You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by "stack (JIRA)" <ji...@apache.org> on 2008/04/02 19:55:24 UTC
[jira] Commented: (HBASE-521) Improve client scanner interface

    [ https://issues.apache.org/jira/browse/HBASE-521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12584697#action_12584697 ] 

stack commented on HBASE-521:
-----------------------------

Comments on v5

Your entry in CHANGES.txt belongs in the INCOMPATIBLE section

You were going to deprecate obtainScanner and make a getScanner instead in HTable?

Can you leave one of these in place?

{code}
-      LOG.debug("Advancing internal scanner to startKey " + localStartKey);^M
       this.currentRegionLocation = getRegionLocation(localStartKey);^M
-      ^M
-      LOG.debug("New region: " + this.currentRegionLocation);^
{code}

I know they look useful but I've found them useful debugging trying to figure problematic region.... WIth them in place I know the last or current problematic region when scanning.

Can you write the below...

{code}
       if (values != null && values.size() != 0) {^M
-        for (Map.Entry<Text, Cell> e: values.entrySet()) {^M
-          // HStoreKey k = (HStoreKey) e.getKey();^M
-          key.setRow(values.getRow());^M
-          key.setVersion(e.getValue().getTimestamp());^M
-          key.setColumn(EMPTY_COLUMN);^M
-          results.put(e.getKey(), e.getValue().getValue());^M
-        }^M
+        return values;^M
       }^M
-      return values == null ? false : values.size() != 0;^M
+      ^M
+      return null;^M
{code}

instead as just

{code}
return values;
{code}

Is this used?

{code}
+public class IdentityTableReduce extends TableReduce<Text, BatchUpdate> {
+  private static final Log LOG =
+    LogFactory.getLog(IdentityTableReduce.class.getName());
{code}

For long in reduce, how about <regionid> ':' <rowindex_zeropadded> or jenkins hash of start row plus an index?  It irks me that we're duplicating info -- row and BatchUpdate.  Same irritiation about TableOutputFormat.

I was going to argue that minimally, a hash of the row would be easier on the sort but then hash will probably not sort same as key which would probably be a problem in that any one reducer would be inserting all over the table row-namespace -- probably not what we want.

This is great in TableOutputFormat

{code}
-    public void write(Text key, MapWritable value) throws IOException {
-      long xid = m_table.startUpdate(key);              // start transaction
-
-      for (Map.Entry<Writable, Writable> e: value.entrySet()) {
-        m_table.put(xid, (Text)e.getKey(),
-            ((ImmutableBytesWritable)e.getValue()).get());
-      }
-      m_table.commit(xid);                              // end transaction
+    public void write(Text key, BatchUpdate value) throws IOException {
+      m_table.commit(value);
     }
{code}

You were going to change the name of HInternalScannerInterface



> Improve client scanner interface
> --------------------------------
>
>                 Key: HBASE-521
>                 URL: https://issues.apache.org/jira/browse/HBASE-521
>             Project: Hadoop HBase
>          Issue Type: Improvement
>          Components: client
>            Reporter: Bryan Duxbury
>            Assignee: Bryan Duxbury
>            Priority: Minor
>             Fix For: 0.2.0
>
>         Attachments: 521-v2.patch, 521-v3.patch, 521-v4.patch, 521-v5.patch, 521.patch
>
>
> The current client scanner interface is pretty ugly. You need to instantiate an HStoreKey and SortedMap<Text, byte[]> externally and then pass them into next. This is pretty bad, because for starters, the client has to choose the implementation of the map when they create it, so it's extra brain cycles to figure that out. HStoreKey doesn't show up anywhere else in the entire client side API, but here it bubbles out of next as a way to get the row and presumably the timestamp of the columns.
> I propose that we supplant HScannerInterface with Scanner, an easier-to-use version for clients. Its next method would look something like:
> {code}
> public RowResult next() throws IOException;
> {code}
> This packs the data up much more cleanly, including using Cells as values instead of raw byte[], meaning you have much more granular timestamp information. You also don't need HStoreKey anymore.
> By breaking Scanner away from HScannerInterface, we can leave the internal scanning code completely alone (keep using HStoreKeys and such) but make the client cleaner.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.