You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by "Clint Morgan (JIRA)" <ji...@apache.org> on 2008/04/29 23:49:55 UTC

[jira] Created: (HBASE-605) allow scanners which return results ordred by a column value

allow scanners which return results ordred by a column value
------------------------------------------------------------

                 Key: HBASE-605
                 URL: https://issues.apache.org/jira/browse/HBASE-605
             Project: Hadoop HBase
          Issue Type: New Feature
          Components: client, regionserver
    Affects Versions: 0.2.0
            Reporter: Clint Morgan
            Priority: Minor


We would like to be able to scan though tables with results ordered by (deserialized) column values. This approach maintains an in-memory sorted set for each ordered-by column in each HStore. This allows us to iterate through the keys in column order, and to random reads on the key to get the full row.

Without the index, then we have to scan through all the rows to get the first result ordered by a column. Thus, when R is the number of rows in a table,  N is the number of ordered-by rows we want, and R >> N we can save a lot of work by not doing the full table scan.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-605) allow scanners which return results ordred by a column value

Posted by "Clint Morgan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12599192#action_12599192 ] 

Clint Morgan commented on HBASE-605:
------------------------------------

Nope. A single SortedSet is created/maintained per order-able column. The we'd like to iterate through it forwards or backwards to respond to the client's scanner request.

I suppose we could maintain two such Sorted (forwards and backwards) by inverting the Comparator, but this seems a waste of space and time....

> allow scanners which return results ordred by a column value
> ------------------------------------------------------------
>
>                 Key: HBASE-605
>                 URL: https://issues.apache.org/jira/browse/HBASE-605
>             Project: Hadoop HBase
>          Issue Type: New Feature
>          Components: client, regionserver
>    Affects Versions: 0.2.0
>            Reporter: Clint Morgan
>            Priority: Minor
>             Fix For: 0.3.0
>
>         Attachments: hbase-605-v2.patch, hbase-605-v3.patch, hbase-605.patch
>
>
> We would like to be able to scan though tables with results ordered by (deserialized) column values. This approach maintains an in-memory sorted set for each ordered-by column in each HStore. This allows us to iterate through the keys in column order, and to random reads on the key to get the full row.
> Without the index, then we have to scan through all the rows to get the first result ordered by a column. Thus, when R is the number of rows in a table,  N is the number of ordered-by rows we want, and R >> N we can save a lot of work by not doing the full table scan.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-605) allow scanners which return results ordred by a column value

Posted by "Clint Morgan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Clint Morgan updated HBASE-605:
-------------------------------

    Attachment: hbase-605-v3.patch

responded to comments:

high level overview in javadoc for OrderedRegionServer

moved stuff into own packages in client.ordred and regionserver.ordered



> allow scanners which return results ordred by a column value
> ------------------------------------------------------------
>
>                 Key: HBASE-605
>                 URL: https://issues.apache.org/jira/browse/HBASE-605
>             Project: Hadoop HBase
>          Issue Type: New Feature
>          Components: client, regionserver
>    Affects Versions: 0.2.0
>            Reporter: Clint Morgan
>            Priority: Minor
>         Attachments: hbase-605-v2.patch, hbase-605-v3.patch, hbase-605.patch
>
>
> We would like to be able to scan though tables with results ordered by (deserialized) column values. This approach maintains an in-memory sorted set for each ordered-by column in each HStore. This allows us to iterate through the keys in column order, and to random reads on the key to get the full row.
> Without the index, then we have to scan through all the rows to get the first result ordered by a column. Thus, when R is the number of rows in a table,  N is the number of ordered-by rows we want, and R >> N we can save a lot of work by not doing the full table scan.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-605) allow scanners which return results ordred by a column value

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598057#action_12598057 ] 

stack commented on HBASE-605:
-----------------------------

This patch has much merit if only for the fact that it verifies (after making few tweaks) that a subclass of HRegionServer is possible.

+ Source is < 80 columns wide in hadoop
+ Should you be subclassing HColumnDescriptor too?  Should it be versioned too?
+ Should we instead add accesors to HRS for the data members you changed from private to protected? (leases and requestCount)
+ We need to make HRegion subclassable or at least be configurable about which HStore to use?  HTable too (As is they are 'polluted' with your sorted column code)



> allow scanners which return results ordred by a column value
> ------------------------------------------------------------
>
>                 Key: HBASE-605
>                 URL: https://issues.apache.org/jira/browse/HBASE-605
>             Project: Hadoop HBase
>          Issue Type: New Feature
>          Components: client, regionserver
>    Affects Versions: 0.2.0
>            Reporter: Clint Morgan
>            Priority: Minor
>         Attachments: hbase-605.patch
>
>
> We would like to be able to scan though tables with results ordered by (deserialized) column values. This approach maintains an in-memory sorted set for each ordered-by column in each HStore. This allows us to iterate through the keys in column order, and to random reads on the key to get the full row.
> Without the index, then we have to scan through all the rows to get the first result ordered by a column. Thus, when R is the number of rows in a table,  N is the number of ordered-by rows we want, and R >> N we can save a lot of work by not doing the full table scan.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-605) allow scanners which return results ordred by a column value

Posted by "Jim Kellerman (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jim Kellerman updated HBASE-605:
--------------------------------

    Fix Version/s:     (was: 0.18.0)
                   0.19.0

> allow scanners which return results ordred by a column value
> ------------------------------------------------------------
>
>                 Key: HBASE-605
>                 URL: https://issues.apache.org/jira/browse/HBASE-605
>             Project: Hadoop HBase
>          Issue Type: New Feature
>          Components: client, regionserver
>    Affects Versions: 0.2.0
>            Reporter: Clint Morgan
>            Priority: Minor
>             Fix For: 0.19.0
>
>         Attachments: hbase-605-v2.patch, hbase-605-v3.patch, hbase-605.patch
>
>
> We would like to be able to scan though tables with results ordered by (deserialized) column values. This approach maintains an in-memory sorted set for each ordered-by column in each HStore. This allows us to iterate through the keys in column order, and to random reads on the key to get the full row.
> Without the index, then we have to scan through all the rows to get the first result ordered by a column. Thus, when R is the number of rows in a table,  N is the number of ordered-by rows we want, and R >> N we can save a lot of work by not doing the full table scan.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-605) allow scanners which return results ordred by a column value

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12598902#action_12598902 ] 

stack commented on HBASE-605:
-----------------------------

In TestOrderedScanner, do you want to remove commented out code?  Do you want to add a class comment that says this test depends/uses a 'special' version of HRegionServer.  Would suggest that all classes that depend on this custom HRegionServer also get marked appropriately in their class comment (@see?): e.g. OrderedScanner won't work unless its going against the ordered HRS -- same for OrderedHRegion.

Some classes are missing licenses.

I suppose package protection prevents you putting all these new classes into a new orderedregionserver package or into a subpackage named regionserver.ordered and client.ordered or some such?

You need to explain somewhere in javadoc what this OrderedRegionServer is, how it works, and how to enable it.  Would suggest that the class comment in the OrderedRegionServer or in the Ordered Interface as good places (otherwise, should I put in place a package.html to which you can add?).  What would be great is that the next time someone shows up asking how they can customize regionserver behavior, we can just point them to your OrderedRegionServer javadoc as an example.

Thanks for adding accessors rather than making data members protected in RegionServer and for making HStore, etc., subclassable.

Otherwise, the patch looks great.


> allow scanners which return results ordred by a column value
> ------------------------------------------------------------
>
>                 Key: HBASE-605
>                 URL: https://issues.apache.org/jira/browse/HBASE-605
>             Project: Hadoop HBase
>          Issue Type: New Feature
>          Components: client, regionserver
>    Affects Versions: 0.2.0
>            Reporter: Clint Morgan
>            Priority: Minor
>         Attachments: hbase-605-v2.patch, hbase-605.patch
>
>
> We would like to be able to scan though tables with results ordered by (deserialized) column values. This approach maintains an in-memory sorted set for each ordered-by column in each HStore. This allows us to iterate through the keys in column order, and to random reads on the key to get the full row.
> Without the index, then we have to scan through all the rows to get the first result ordered by a column. Thus, when R is the number of rows in a table,  N is the number of ordered-by rows we want, and R >> N we can save a lot of work by not doing the full table scan.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-605) allow scanners which return results ordred by a column value

Posted by "Clint Morgan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Clint Morgan updated HBASE-605:
-------------------------------

    Attachment: hbase-605.patch

This patch contains a minimal implementation, and small unit test.

One known deficiency is that the sorted set is build twice per hregion upon splitting. This is due to hergions being opened then immediately closed upon a split.

I've tested a bit more in our layers above hbase and it works for me (so far). 

> allow scanners which return results ordred by a column value
> ------------------------------------------------------------
>
>                 Key: HBASE-605
>                 URL: https://issues.apache.org/jira/browse/HBASE-605
>             Project: Hadoop HBase
>          Issue Type: New Feature
>          Components: client, regionserver
>    Affects Versions: 0.2.0
>            Reporter: Clint Morgan
>            Priority: Minor
>         Attachments: hbase-605.patch
>
>
> We would like to be able to scan though tables with results ordered by (deserialized) column values. This approach maintains an in-memory sorted set for each ordered-by column in each HStore. This allows us to iterate through the keys in column order, and to random reads on the key to get the full row.
> Without the index, then we have to scan through all the rows to get the first result ordered by a column. Thus, when R is the number of rows in a table,  N is the number of ordered-by rows we want, and R >> N we can save a lot of work by not doing the full table scan.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-605) allow scanners which return results ordred by a column value

Posted by "Bryan Duxbury (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12599188#action_12599188 ] 

Bryan Duxbury commented on HBASE-605:
-------------------------------------

You can't supply a Comparator to SortedSet that you code to act in reverse?

> allow scanners which return results ordred by a column value
> ------------------------------------------------------------
>
>                 Key: HBASE-605
>                 URL: https://issues.apache.org/jira/browse/HBASE-605
>             Project: Hadoop HBase
>          Issue Type: New Feature
>          Components: client, regionserver
>    Affects Versions: 0.2.0
>            Reporter: Clint Morgan
>            Priority: Minor
>             Fix For: 0.3.0
>
>         Attachments: hbase-605-v2.patch, hbase-605-v3.patch, hbase-605.patch
>
>
> We would like to be able to scan though tables with results ordered by (deserialized) column values. This approach maintains an in-memory sorted set for each ordered-by column in each HStore. This allows us to iterate through the keys in column order, and to random reads on the key to get the full row.
> Without the index, then we have to scan through all the rows to get the first result ordered by a column. Thus, when R is the number of rows in a table,  N is the number of ordered-by rows we want, and R >> N we can save a lot of work by not doing the full table scan.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-605) allow scanners which return results ordred by a column value

Posted by "Clint Morgan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Clint Morgan updated HBASE-605:
-------------------------------

    Resolution: Won't Fix
        Status: Resolved  (was: Patch Available)

Resolving this issue, as I've decided to go the table indexed approach of HBASE-883

> allow scanners which return results ordred by a column value
> ------------------------------------------------------------
>
>                 Key: HBASE-605
>                 URL: https://issues.apache.org/jira/browse/HBASE-605
>             Project: Hadoop HBase
>          Issue Type: New Feature
>          Components: client, regionserver
>    Affects Versions: 0.2.0
>            Reporter: Clint Morgan
>            Priority: Minor
>             Fix For: 0.19.0
>
>         Attachments: hbase-605-v2.patch, hbase-605-v3.patch, hbase-605.patch
>
>
> We would like to be able to scan though tables with results ordered by (deserialized) column values. This approach maintains an in-memory sorted set for each ordered-by column in each HStore. This allows us to iterate through the keys in column order, and to random reads on the key to get the full row.
> Without the index, then we have to scan through all the rows to get the first result ordered by a column. Thus, when R is the number of rows in a table,  N is the number of ordered-by rows we want, and R >> N we can save a lot of work by not doing the full table scan.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-605) allow scanners which return results ordred by a column value

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-605:
------------------------

    Fix Version/s: 0.3.0

OK Clint.  I moved this to 0.3 hbase for now.

> allow scanners which return results ordred by a column value
> ------------------------------------------------------------
>
>                 Key: HBASE-605
>                 URL: https://issues.apache.org/jira/browse/HBASE-605
>             Project: Hadoop HBase
>          Issue Type: New Feature
>          Components: client, regionserver
>    Affects Versions: 0.2.0
>            Reporter: Clint Morgan
>            Priority: Minor
>             Fix For: 0.3.0
>
>         Attachments: hbase-605-v2.patch, hbase-605-v3.patch, hbase-605.patch
>
>
> We would like to be able to scan though tables with results ordered by (deserialized) column values. This approach maintains an in-memory sorted set for each ordered-by column in each HStore. This allows us to iterate through the keys in column order, and to random reads on the key to get the full row.
> Without the index, then we have to scan through all the rows to get the first result ordered by a column. Thus, when R is the number of rows in a table,  N is the number of ordered-by rows we want, and R >> N we can save a lot of work by not doing the full table scan.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-605) allow scanners which return results ordred by a column value

Posted by "stack (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12599164#action_12599164 ] 

stack commented on HBASE-605:
-----------------------------

Unfortunately, we're still java5.  We probably won't go to java6 as a requirement until hbase 0.3, to match hadoop 0.18.  Please purge the java6isms (NavigableSet in SortedColumn).

Also, I get this compiling:

{code}
    [javac] /Users/stack/Documents/checkouts/trunk/src/java/org/apache/hadoop/hbase/LocalHBaseCluster.java:121: cannot find symbol
    [javac] symbol  : constructor IOException(java.lang.Exception)
    [javac] location: class java.io.IOException
    [javac]         throw new IOException(e);
{code}

Do you?

Thanks Clint (I already updated the FAQ to point to OrderedRegionServer as example modifying HRegionServer behavior).

> allow scanners which return results ordred by a column value
> ------------------------------------------------------------
>
>                 Key: HBASE-605
>                 URL: https://issues.apache.org/jira/browse/HBASE-605
>             Project: Hadoop HBase
>          Issue Type: New Feature
>          Components: client, regionserver
>    Affects Versions: 0.2.0
>            Reporter: Clint Morgan
>            Priority: Minor
>         Attachments: hbase-605-v2.patch, hbase-605-v3.patch, hbase-605.patch
>
>
> We would like to be able to scan though tables with results ordered by (deserialized) column values. This approach maintains an in-memory sorted set for each ordered-by column in each HStore. This allows us to iterate through the keys in column order, and to random reads on the key to get the full row.
> Without the index, then we have to scan through all the rows to get the first result ordered by a column. Thus, when R is the number of rows in a table,  N is the number of ordered-by rows we want, and R >> N we can save a lot of work by not doing the full table scan.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-605) allow scanners which return results ordred by a column value

Posted by "stack (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

stack updated HBASE-605:
------------------------

    Status: Patch Available  (was: Open)

Mark this as patch available so it gets a review

> allow scanners which return results ordred by a column value
> ------------------------------------------------------------
>
>                 Key: HBASE-605
>                 URL: https://issues.apache.org/jira/browse/HBASE-605
>             Project: Hadoop HBase
>          Issue Type: New Feature
>          Components: client, regionserver
>    Affects Versions: 0.2.0
>            Reporter: Clint Morgan
>            Priority: Minor
>         Attachments: hbase-605.patch
>
>
> We would like to be able to scan though tables with results ordered by (deserialized) column values. This approach maintains an in-memory sorted set for each ordered-by column in each HStore. This allows us to iterate through the keys in column order, and to random reads on the key to get the full row.
> Without the index, then we have to scan through all the rows to get the first result ordered by a column. Thus, when R is the number of rows in a table,  N is the number of ordered-by rows we want, and R >> N we can save a lot of work by not doing the full table scan.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HBASE-605) allow scanners which return results ordred by a column value

Posted by "Clint Morgan (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HBASE-605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12599181#action_12599181 ] 

Clint Morgan commented on HBASE-605:
------------------------------------

Unfortunately I need the NavigableSet to get a descending iterator. SortedSet does not provide this functionality. Could use some 3rd party data structure but, ...

I don't get the IOException compile error, thats a java6 change too...

For the time being, I'm inclined to just leave this as a java6 patch and wait until java6 adoption to apply it. Works for me now, and I need to spend time on other things.


> allow scanners which return results ordred by a column value
> ------------------------------------------------------------
>
>                 Key: HBASE-605
>                 URL: https://issues.apache.org/jira/browse/HBASE-605
>             Project: Hadoop HBase
>          Issue Type: New Feature
>          Components: client, regionserver
>    Affects Versions: 0.2.0
>            Reporter: Clint Morgan
>            Priority: Minor
>         Attachments: hbase-605-v2.patch, hbase-605-v3.patch, hbase-605.patch
>
>
> We would like to be able to scan though tables with results ordered by (deserialized) column values. This approach maintains an in-memory sorted set for each ordered-by column in each HStore. This allows us to iterate through the keys in column order, and to random reads on the key to get the full row.
> Without the index, then we have to scan through all the rows to get the first result ordered by a column. Thus, when R is the number of rows in a table,  N is the number of ordered-by rows we want, and R >> N we can save a lot of work by not doing the full table scan.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HBASE-605) allow scanners which return results ordred by a column value

Posted by "Clint Morgan (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HBASE-605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Clint Morgan updated HBASE-605:
-------------------------------

    Attachment: hbase-605-v2.patch

updated to trunk and responded to comments. Subclassing minimizes pollution with core code...

> allow scanners which return results ordred by a column value
> ------------------------------------------------------------
>
>                 Key: HBASE-605
>                 URL: https://issues.apache.org/jira/browse/HBASE-605
>             Project: Hadoop HBase
>          Issue Type: New Feature
>          Components: client, regionserver
>    Affects Versions: 0.2.0
>            Reporter: Clint Morgan
>            Priority: Minor
>         Attachments: hbase-605-v2.patch, hbase-605.patch
>
>
> We would like to be able to scan though tables with results ordered by (deserialized) column values. This approach maintains an in-memory sorted set for each ordered-by column in each HStore. This allows us to iterate through the keys in column order, and to random reads on the key to get the full row.
> Without the index, then we have to scan through all the rows to get the first result ordered by a column. Thus, when R is the number of rows in a table,  N is the number of ordered-by rows we want, and R >> N we can save a lot of work by not doing the full table scan.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.